projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
1.05k stars 134 forks source link

[Bug]: GC isn't removing objects from bucket #10008

Closed paul-bormans-pcgw closed 1 day ago

paul-bormans-pcgw commented 4 days ago

What happened

In a simple compose setup i'm running minio and nessie + nessie-gc. I'm running just the gc mode, so all in one go.

Logging is telling me that files are being removed:

[0m 2024-11-28 15:15:19,308 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#1d7385e5-0cd3-4094-962b-6167054ea700: Start walking the commit log of Branch{name=main, metadata=null, hash=be097803b9112d08d4725485f5610ae057ed42c505d2141b31d2d01e6b938106} using cutoff at timestamp 2024-11-28T12:15:18.178699551Z.
[0m 2024-11-28 15:15:20,197 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#1d7385e5-0cd3-4094-962b-6167054ea700: Finished walking the commit log of Branch{name=main, metadata=null, hash=be097803b9112d08d4725485f5610ae057ed42c505d2141b31d2d01e6b938106} using cutoff at timestamp 2024-11-28T12:15:18.178699551Z after 225 commits, commit 1547a75561fca3b5bc9830ae3fe1bc55082b699c69c08fe28c97adc52f128184 is the first non-live commit.
[0m 2024-11-28 15:15:20,198 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#1d7385e5-0cd3-4094-962b-6167054ea700: Finished walking all named references, took PT1.188581737S: numReferences=1, numCommits=225, numContents=224, shortCircuits=0.
[0m Finished Nessie-GC identify phase finished with status IDENTIFY_SUCCESS after PT1.195172S, live-content-set ID is 1d7385e5-0cd3-4094-962b-6167054ea700.
[0m 2024-11-28 15:15:20,284 [main] INFO  o.p.g.e.local.DefaultLocalExpire - live-set#1d7385e5-0cd3-4094-962b-6167054ea700: Starting expiry.
[0m 2024-11-28 15:15:20,316 [ForkJoinPool-3-worker-2] INFO  org.apache.iceberg.CatalogUtil - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
[0m 2024-11-28 15:18:52,930 [ForkJoinPool-3-worker-2] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#1d7385e5-0cd3-4094-962b-6167054ea700 content#4cbe6f64-203e-486e-9eed-d252c9e19e1f: Found 11435 total files in base location s3://demobucket/ts/pack_96fbe8c4-6ddb-4063-82c2-92ab16937171/, 1606 files considered expired, 9718 files considered live, 111 files are newer than max-file-modification-time.
[0m 2024-11-28 15:18:52,931 [main] INFO  o.p.g.e.local.DefaultLocalExpire - live-set#1d7385e5-0cd3-4094-962b-6167054ea700: Expiry finished, took PT3M32.646451679S, deletion summary: DeleteSummary{deleted=1606, failures=0}.
[0m Nessie-GC sweep phase for live-content-set 1d7385e5-0cd3-4094-962b-6167054ea700 finished with status EXPIRY_SUCCESS after PT1.195172S, deleted 1606 files, 0 files could not be deleted.

However when simply counting the files in the bucket I see that nothing gets removed?

I have confirmed many rows of older data are being removed by Trino; so newer snapshot-id's have far less data.... these should eventually be removed by nessie-gc right?

How to reproduce it

  1. nessie-gc with:
      while true; do \
        MFM=$$(date -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ');
        echo 'Running gc'; \
        java -Dlog.level.console=INFO -jar /nessie-gc.jar \
        gc \
        --jdbc-url=jdbc:postgresql://postgres:5432/nessie_gc \
        --jdbc-user=postgres \
        --jdbc-password=postgres \
        --uri=http://nessie:19120/api/v2 \
        --iceberg=s3.path-style-access=true,s3.access-key-id=minioadmin,s3.secret-access-key=minioadmin,s3.endpoint=http://minio:9000/ \
        --default-cutoff=PT3H \
        --max-file-modification=$${MFM} \
        ; \
        sleep 60; done"
  2. A script is adding content to iceberg using pyiceberg, say a commit every second.
  3. Using Trino old data is removed: "DELETE FROM pack WHERE ....etc"

Nessie server type (docker/uber-jar/built from source) and version

docker: quay.io/projectnessie/nessie-gc:0.100.2

Client type (Ex: UI/Spark/pynessie ...) and version

No response

Additional information

It would be helpful if INFO logging by nessie-gc could say which objects are removed.

snazy commented 1 day ago

@paul-bormans-pcgw this is rather a question, but Github issues are for bugs and tasks. Can you please post your question on out Zulip chat?

Our website explains here when files become eligible for deletion.