Open dorsegal opened 1 week ago
Maybe this is related as well https://github.com/projectnessie/nessie/issues/8263
@dorsegal can you provide more details, or better a reproducer?
My setup is with kafka iceberg connect. create a table and use kafka iceberg sink to write data into the table. After several commits run GC and try to run rewrite files using spark.
spark.sql( "CALL nessie.system.rewrite_data_files(table => 'table', options => map('partial-progress.enabled','true', 'max-concurrent-file-group-rewrites', '30'))" ).show()
I can provide more logs if needed just don't know which one. From GC logs I see that it deleted some files.
What I meant is a full reproducer mentioning every step starting from scratch, so that s/o can get to the same behavior on a "clean"/empty environment.
java -jar nessie-gc-0.99.0.jar gc
failed:
Caused by: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE robot_dev.robot_data, content-ID fc122060-bf21-44d3-b776-fbecb2d23715 at commit 47a35e9867b0408c65feb09d7140a29d198354edf3a1aa0dc7cc09d192b07c27 via s3://ice-lake/robot_dev/robot_data/metadata/00000-6b65b94d-6370-4c43-9baa-b40ed0770c5d.metadata.json
After expire snapshot in Spark SQL:
CALL nessie.system.expire_snapshots('nessie.robot_dev.robot_data', TIMESTAMP '2024-10-15 00:00:00.000', 1)
Count of snapshots reduced and manifest files have been deleted. But Nessie metadata maybe not sync the changes of snapshots
@snazy
What happened
After I used GC I started to get file does not exist error. Looks like the file was deleted but was not deleted from metadata.
https://github.com/apache/iceberg/issues/8338
How to reproduce it
Nessie server type (docker/uber-jar/built from source) and version
kubernetes 0.99.0
Client type (Ex: UI/Spark/pynessie ...) and version
Spark
Additional information
No response