treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.41k stars 350 forks source link

Support Athena with Delta tables #6351

Open ozkatz opened 1 year ago

ozkatz commented 1 year ago

currently, using Athena with lakeFS works by registering symlinks into Glue

for Delta tables, this won't work (or worse: will cause deletef parquet files to also be queried).

For delta we should either generate symlinks based on the delta log, or find another way to query lakeFS from Athena.

kesarwam commented 1 year ago

Yes, it queries deleted files also

ozkatz commented 1 year ago

Some additional context:

Blindly exporting symlinks is a bad idea for open table formats - symlinks work with hive style tables so any parquet file appearing in the symlink will be queried and partition information might be lost. so that's a no go.

An almost viable option was to export a "shallow clone" of the delta table: as a post-hook, write a single file, named <ref id>/<table name>/_delta_log/00000000000000000000.json. Inside it, specify the schema, metadata and list of files from the latest calculated snapshot of the table we're exporting - but use absolute URIs to point to the physical addresses of the files making up the given table.

I tested this (without lakeFS) by constructing a json file that points to arbitrary s3://... paths on the same bucket but in another directory. This works well for Unity! Also, it might work well for Spark on Glue (haven't tried) - but for Athena, I'm hitting this wall: https://github.com/trinodb/trino/issues/17011 - I see exactly this behavior on Athena (v3). It seems like this fix: https://github.com/trinodb/trino/pull/17038 (once merged, released in Trino, picked up by Athena and made available..) will solve it, but it might take quite a while to get there.

Other options that might work:

  1. Reading the Delta snapshot and exporting symlinks in a way that preserves partitioning information and only includes "live" parquet files (essentially, exporting the Delta Table as a Hive table)
  2. Doing the same as above ^ but instead of symlinks, export as Iceberg, which should default to absolute paths anyway.
  3. Use Athena's federated querying abilities: This has a few downsides: it requires quite a bit of development work - the connector would be specific for delta on lakeFS and would have to implement parts of Delta such as predicate pushdowns and other low level capabilities. The other downside is the operations cost for the user: having to install lambda functions from the AWS marketplace, setup IAM for them, etc.

Not a fan of any of the above :)

github-actions[bot] commented 10 months ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions[bot] commented 9 months ago

Closing this issue because it has been stale for 7 days with no activity.