projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
1.02k stars 131 forks source link

[Bug]: GC causing "org.apache.iceberg.exceptions.NotFoundException: File does not exist" #9749

Open dorsegal opened 1 week ago

dorsegal commented 1 week ago

What happened

After I used GC I started to get file does not exist error. Looks like the file was deleted but was not deleted from metadata.

https://github.com/apache/iceberg/issues/8338

How to reproduce it

  1. Create table
  2. Add data
  3. Run GC with GC command
  4. Try to read all data from table again

Nessie server type (docker/uber-jar/built from source) and version

kubernetes 0.99.0

Client type (Ex: UI/Spark/pynessie ...) and version

Spark

Additional information

No response

dorsegal commented 1 week ago

Maybe this is related as well https://github.com/projectnessie/nessie/issues/8263

snazy commented 1 week ago

@dorsegal can you provide more details, or better a reproducer?

dorsegal commented 1 week ago

My setup is with kafka iceberg connect. create a table and use kafka iceberg sink to write data into the table. After several commits run GC and try to run rewrite files using spark.

spark.sql( "CALL nessie.system.rewrite_data_files(table => 'table', options => map('partial-progress.enabled','true', 'max-concurrent-file-group-rewrites', '30'))" ).show()

I can provide more logs if needed just don't know which one. From GC logs I see that it deleted some files.

snazy commented 1 week ago

What I meant is a full reproducer mentioning every step starting from scratch, so that s/o can get to the same behavior on a "clean"/empty environment.

yunlou11 commented 1 week ago
java -jar nessie-gc-0.99.0.jar gc  

failed:

Caused by: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE robot_dev.robot_data, content-ID fc122060-bf21-44d3-b776-fbecb2d23715 at commit 47a35e9867b0408c65feb09d7140a29d198354edf3a1aa0dc7cc09d192b07c27 via s3://ice-lake/robot_dev/robot_data/metadata/00000-6b65b94d-6370-4c43-9baa-b40ed0770c5d.metadata.json

After expire snapshot in Spark SQL:

CALL nessie.system.expire_snapshots('nessie.robot_dev.robot_data', TIMESTAMP '2024-10-15 00:00:00.000', 1)

Count of snapshots reduced and manifest files have been deleted. But Nessie metadata maybe not sync the changes of snapshots

@snazy