Explicitly collect column statistics for the most important timestamp columns
Enable the object storage location provider, which adds a hash component to file paths
Make icebergTableProperties configurable, so users can override the defaults.
Bad rows must not exceed the maximum size allowed by the sink
Trap SIGTERM and start graceful shutdown
This loader tries to do graceful shutdown. Unfortunately, some of the 3rd party libraries try to do their own graceful shutdown as soon as the JVM begins to shutdown. This is bad for the loader because it interferes with our delayed graceful shutdown.
This commit works by trapping the SIGTERM so we explicitly start our own graceful shutdown before the JVM (and therefore 3rd party libs) start their own shutdown.
Improve default Hudi configuration settings
I have recently tested this loader with Hudi at high event volume. I believe I have found a good combination of Hudi configuration options that work well with this loader.
Selected highlights:
BULK_INSERT is the best write operation to use with this loader. It is more compatible with how we share the local spark context across "write" tasks and "transform" tasks.
Clustering can be enabled safely, if settings are chosen carefully. This enables bumping small parquet files into larger files. The settings are chosen for reasonable memory requirements and without impacting latency.
Hudi parallelism settings should be turned down to 1. This is more compatible with how this loader uses a small local spark context, which is shared across "write" tasks and "transform" tasks.
Enable syncing Hudi metadata to the Glue Catalog
Hudi has a feature in which it syncs the table's schema and partitions to the Glue catalog. This is helpful for users who want to query a Hudi table via AWS Athena.
This commit adds a missing dependency and config settings so the Hudi/Glue sync now works with this loader, if configured.
Turn on metrics logging for Iceberg
By changing the log level of LoggingMetricsReporter to info we get an appropriate and helpful level of information in the logs when writing to Iceberg format.
Improve default Delta table properties
This commit changes the default table properties when creating a new table. The changes are only relevant when running the loader for the first time. The new defaults are based on our recent experience of loading to Delta with high even volume.
delta.logRetentionDuration. This affects users who periodically run a compaction job on their Delta Table. Previously, we kept old log files for 30 days beyond the compaction job. For high volume pipelines, by reducing this to 1 day we can reduce the number of log files to be managed by Delta.
delta.dataSkippingStatsColumns. We want Delta to collect stats on columns load_tstamp, collector_tstamp, derived_tstamp and dvce_created_tstamp. Previously we achieving this by moving those four columns to the far left of the table, and then setting dataSkippingColumns = 4. Delta has a new option dataSkippingStatsColumns where we can explicitly name the columns to index. It is better for the end user, because they can potentially alter the table to add any custom column to the list.
delta.checkpointInterval. By default, Delta creates a checkpoint every 10 commits. Because the Lake Loader commits frequently, and because it scales horizontally to multiple loaders, we have found improved efficiency by decreasing how often it writes a checkpoint.
This commit also sets the spark option spark.databricks.delta.autoCompact.enabled to false. This is only needed in case the customer ever manually sets the table property delta.autoOptimize.autoCompact: it is important we override the table property.
Bump common-streams and iceberg to latest versions
In #60 I bumped common-streams to 0.7.0-M2. This consolidates the version on 0.7.0.
In #59 I bumped iceberg to 1.5.1. But the Iceberg release notes strongly encourage updating to 1.5.2.
Jira ref: PDP-1221
Improve default Iceberg table properties
Bad rows must not exceed the maximum size allowed by the sink
Trap SIGTERM and start graceful shutdown
This loader tries to do graceful shutdown. Unfortunately, some of the 3rd party libraries try to do their own graceful shutdown as soon as the JVM begins to shutdown. This is bad for the loader because it interferes with our delayed graceful shutdown.
This commit works by trapping the SIGTERM so we explicitly start our own graceful shutdown before the JVM (and therefore 3rd party libs) start their own shutdown.
Improve default Hudi configuration settings
I have recently tested this loader with Hudi at high event volume. I believe I have found a good combination of Hudi configuration options that work well with this loader.
Selected highlights:
BULK_INSERT
is the best write operation to use with this loader. It is more compatible with how we share the local spark context across "write" tasks and "transform" tasks.Enable syncing Hudi metadata to the Glue Catalog
Hudi has a feature in which it syncs the table's schema and partitions to the Glue catalog. This is helpful for users who want to query a Hudi table via AWS Athena.
This commit adds a missing dependency and config settings so the Hudi/Glue sync now works with this loader, if configured.
Turn on metrics logging for Iceberg
By changing the log level of
LoggingMetricsReporter
toinfo
we get an appropriate and helpful level of information in the logs when writing to Iceberg format.Improve default Delta table properties
This commit changes the default table properties when creating a new table. The changes are only relevant when running the loader for the first time. The new defaults are based on our recent experience of loading to Delta with high even volume.
delta.logRetentionDuration
. This affects users who periodically run a compaction job on their Delta Table. Previously, we kept old log files for 30 days beyond the compaction job. For high volume pipelines, by reducing this to 1 day we can reduce the number of log files to be managed by Delta.delta.dataSkippingStatsColumns
. We want Delta to collect stats on columnsload_tstamp
,collector_tstamp
,derived_tstamp
anddvce_created_tstamp
. Previously we achieving this by moving those four columns to the far left of the table, and then settingdataSkippingColumns = 4
. Delta has a new optiondataSkippingStatsColumns
where we can explicitly name the columns to index. It is better for the end user, because they can potentially alter the table to add any custom column to the list.delta.checkpointInterval
. By default, Delta creates a checkpoint every 10 commits. Because the Lake Loader commits frequently, and because it scales horizontally to multiple loaders, we have found improved efficiency by decreasing how often it writes a checkpoint.This commit also sets the spark option
spark.databricks.delta.autoCompact.enabled
to false. This is only needed in case the customer ever manually sets the table propertydelta.autoOptimize.autoCompact
: it is important we override the table property.Bump common-streams and iceberg to latest versions
In #60 I bumped common-streams to 0.7.0-M2. This consolidates the version on 0.7.0.
In #59 I bumped iceberg to 1.5.1. But the Iceberg release notes strongly encourage updating to 1.5.2.