Version 0.1.0 of the Lake Loader only properly supported Delta as an output format. Future versions will support more output formats and catalog types.
To support more formats and catalogs, we will need to add more 3rd party runtime libraries into the docker image. In some cases, these dependency bundles are annoyingly large and contain lots of (shaded) transitive dependencies of yet more 3rd party libs. Shaded transitive dependencies are a problem when managing CVEs, because we cannot bump the version of the shaded lib via a configuration change in our Dependencies file.
For example, to enable Hudi as an output format, we need to add the hudi-spark3.4-bundle. This single jar file contains 23546 classes (that's huge!). And that includes an old version of jackson-databind with known CVEs, and an old version of jetty-http with known CVEs.
I don't want to add these mega-bundles into our main docker images. These dependencies are not needed at all for output formats like Delta and Iceberg (for some catalog types). Users of those formats will not thank us for adding huge runtime dependencies that would get flagged by CVE-scanning tools.
I propose to separate the docker builds and use different tags to distinguish them. For example, the next release of Lake Loader will include:
snowplow/lake-loader-gcp:0.2.0 for Delta output and for Iceberg output with limited choice of catalog.
snowplow/lake-loader-gcp:0.2.0-hudi which in addition contains the large hudi bundle.
snowplow/lake-loader-gcp:0.2.0-biglake which supports Iceberg output format using the BigLake catalog.
(...and repeated for aws/azure flavours of the loader)
Version 0.1.0 of the Lake Loader only properly supported Delta as an output format. Future versions will support more output formats and catalog types.
To support more formats and catalogs, we will need to add more 3rd party runtime libraries into the docker image. In some cases, these dependency bundles are annoyingly large and contain lots of (shaded) transitive dependencies of yet more 3rd party libs. Shaded transitive dependencies are a problem when managing CVEs, because we cannot bump the version of the shaded lib via a configuration change in our Dependencies file.
For example, to enable Hudi as an output format, we need to add the hudi-spark3.4-bundle. This single jar file contains 23546 classes (that's huge!). And that includes an old version of jackson-databind with known CVEs, and an old version of jetty-http with known CVEs.
I don't want to add these mega-bundles into our main docker images. These dependencies are not needed at all for output formats like Delta and Iceberg (for some catalog types). Users of those formats will not thank us for adding huge runtime dependencies that would get flagged by CVE-scanning tools.
I propose to separate the docker builds and use different tags to distinguish them. For example, the next release of Lake Loader will include:
snowplow/lake-loader-gcp:0.2.0
for Delta output and for Iceberg output with limited choice of catalog.snowplow/lake-loader-gcp:0.2.0-hudi
which in addition contains the large hudi bundle.snowplow/lake-loader-gcp:0.2.0-biglake
which supports Iceberg output format using the BigLake catalog.(...and repeated for aws/azure flavours of the loader)