snowplow-incubator / snowplow-lake-loader

Snowplow Lake Loader
Other
0 stars 3 forks source link

Separate docker builds for minority lake formats #40

Closed istreeter closed 9 months ago

istreeter commented 10 months ago

Version 0.1.0 of the Lake Loader only properly supported Delta as an output format. Future versions will support more output formats and catalog types.

To support more formats and catalogs, we will need to add more 3rd party runtime libraries into the docker image. In some cases, these dependency bundles are annoyingly large and contain lots of (shaded) transitive dependencies of yet more 3rd party libs. Shaded transitive dependencies are a problem when managing CVEs, because we cannot bump the version of the shaded lib via a configuration change in our Dependencies file.

For example, to enable Hudi as an output format, we need to add the hudi-spark3.4-bundle. This single jar file contains 23546 classes (that's huge!). And that includes an old version of jackson-databind with known CVEs, and an old version of jetty-http with known CVEs.

I don't want to add these mega-bundles into our main docker images. These dependencies are not needed at all for output formats like Delta and Iceberg (for some catalog types). Users of those formats will not thank us for adding huge runtime dependencies that would get flagged by CVE-scanning tools.

I propose to separate the docker builds and use different tags to distinguish them. For example, the next release of Lake Loader will include:

(...and repeated for aws/azure flavours of the loader)