snowplow-incubator / snowplow-lake-loader

Snowplow Lake Loader
Other
0 stars 2 forks source link

Various features for 0.4.0 #62

Closed istreeter closed 3 months ago

istreeter commented 3 months ago

Jira ref: PDP-1221

Improve default Iceberg table properties


Bad rows must not exceed the maximum size allowed by the sink


Trap SIGTERM and start graceful shutdown

This loader tries to do graceful shutdown. Unfortunately, some of the 3rd party libraries try to do their own graceful shutdown as soon as the JVM begins to shutdown. This is bad for the loader because it interferes with our delayed graceful shutdown.

This commit works by trapping the SIGTERM so we explicitly start our own graceful shutdown before the JVM (and therefore 3rd party libs) start their own shutdown.


Improve default Hudi configuration settings

I have recently tested this loader with Hudi at high event volume. I believe I have found a good combination of Hudi configuration options that work well with this loader.

Selected highlights:


Enable syncing Hudi metadata to the Glue Catalog

Hudi has a feature in which it syncs the table's schema and partitions to the Glue catalog. This is helpful for users who want to query a Hudi table via AWS Athena.

This commit adds a missing dependency and config settings so the Hudi/Glue sync now works with this loader, if configured.


Turn on metrics logging for Iceberg

By changing the log level of LoggingMetricsReporter to info we get an appropriate and helpful level of information in the logs when writing to Iceberg format.


Improve default Delta table properties

This commit changes the default table properties when creating a new table. The changes are only relevant when running the loader for the first time. The new defaults are based on our recent experience of loading to Delta with high even volume.

This commit also sets the spark option spark.databricks.delta.autoCompact.enabled to false. This is only needed in case the customer ever manually sets the table property delta.autoOptimize.autoCompact: it is important we override the table property.


Bump common-streams and iceberg to latest versions

In #60 I bumped common-streams to 0.7.0-M2. This consolidates the version on 0.7.0.

In #59 I bumped iceberg to 1.5.1. But the Iceberg release notes strongly encourage updating to 1.5.2.