Open ozkatz opened 1 year ago
Yes, it queries deleted files also
Some additional context:
Blindly exporting symlinks is a bad idea for open table formats - symlinks work with hive style tables so any parquet file appearing in the symlink will be queried and partition information might be lost. so that's a no go.
An almost viable option was to export a "shallow clone" of the delta table: as a post-hook, write a single file, named <ref id>/<table name>/_delta_log/00000000000000000000.json
. Inside it, specify the schema, metadata and list of files from the latest calculated snapshot of the table we're exporting - but use absolute URIs to point to the physical addresses of the files making up the given table.
I tested this (without lakeFS) by constructing a json file that points to arbitrary s3://...
paths on the same bucket but in another directory. This works well for Unity! Also, it might work well for Spark on Glue (haven't tried) - but for Athena, I'm hitting this wall: https://github.com/trinodb/trino/issues/17011 - I see exactly this behavior on Athena (v3).
It seems like this fix: https://github.com/trinodb/trino/pull/17038 (once merged, released in Trino, picked up by Athena and made available..) will solve it, but it might take quite a while to get there.
Other options that might work:
Not a fan of any of the above :)
This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.
Closing this issue because it has been stale for 7 days with no activity.
currently, using Athena with lakeFS works by registering symlinks into Glue
for Delta tables, this won't work (or worse: will cause deletef parquet files to also be queried).
For delta we should either generate symlinks based on the delta log, or find another way to query lakeFS from Athena.