projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
1.03k stars 130 forks source link

iceberg tables "all-meta-tables" corrupted after nessie-gc #8263

Open KingLommel opened 7 months ago

KingLommel commented 7 months ago

Issue description

What is the problem:

After a successful nessie-gc run, the iceberg tables in https://iceberg.apache.org/docs/nightly/spark-queries/#all-metadata-tables are corrupted.

What did I do:

  1. I create a new iceberg table :white_check_mark:
  2. I run nessie-gc to clean orphaned files :white_check_mark:
  3. I checked If nessie-gc has done its job by checking if the number of .avro and .json files in my s3-bucket have been reduced :white_check_mark: I also had a look into the postgres database "nessie_gc" and checked the tables for changes. :white_check_mark:
  4. I read files with spark and showed table content :white_check_mark:
  5. Since the table is an iceberg table, I should be able to read \<table-name>.all_data_files. But at that point I get an error org.apache.iceberg.exceptions.NotFoundException: Failed to open input stream for file \<filename> :x:

For my tests:

What I expect:

I would expect the gc-tool to take care of all metadata in https://iceberg.apache.org/docs/nightly/spark-queries/#all-metadata-tables. I also would expect the number of snapshots from https://iceberg.apache.org/docs/nightly/spark-queries/#snapshots to be reduced. But the number of snapshots is the same as before running nessie-gc

Versions:

dimas-b commented 7 months ago

@KingLommel : Did you try querying actual table data? Does that work? (Edit: I that this works)

AFAIK, <table-name>.all_data_files is a synthetic table produced by Iceberg on-the-fly. It does not actually represent the data in the iceberg table itself. It is unfortunate that this information cannot be retrieved after a Nessie GC. We'll look into this).