treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.46k stars 359 forks source link

BUG: GC: GetGarbageCollectionCommits sometimes fails on missing starting point commit #8261

Closed N-o-Z closed 1 month ago

N-o-Z commented 1 month ago

Spark error:

io.lakefs.clients.sdk.ApiException: Message: Internal Server Error
HTTP response code: 500
HTTP response body: {"message":"save garbage collection commits: find expired commits: commit not found: 

Digging into the code - the failure came from active_commits.go GetGarbageCollectionCommits line 58:

    for startingPointIterator.Next() {
        startingPoint := startingPointIterator.Value()
        retentionDays := int(rules.DefaultRetentionDays)
        commitNode, ok := commitsMap[startingPoint.CommitID]
        if !ok {
            return nil, fmt.Errorf("%w: %s", ErrCommitNotFound, startingPoint.CommitID)
        }

Seems like the commit was not found in the commitsMap. The retryer in the Scala API client is masking an error that might indicate a potential bug in lakeFS as we see this flow being retried several times until success or until retry exhaustion. This might be an indication for a bug in the GCStartingPointIterator or the logic which populates commitsMap

arielshaqed commented 1 month ago

Flaky

The commit itself exists but is new (see below). Retrying -- even immediately -- appears to help. And this is definitely flakiness, otherwise once it failed it would be hard for the issue to go away - I would expect to meet such an issue a lot more. There are no races in the GC thread itself as it is single-threaded. So a race has to be with another thread of execution (not a "POSIX" thread, just a separate concurrent sequence of steps).

Diagnosis

So I'm looking for a race. Seek not, find not to quote some teachings (אם לא תחפש לא תמצא). Indeed, I found a race which seems plausible - so I will fix that race.

Thread 1Thread 2
Start preparing GC
Start listing commits
...list...add commit c0ffee
Start tracking commits from branch starting points
Attempt to read commit c0ffee

The new commit c0ffee might not be present in the list of commits. If it isn't, we will see the exact failure described.

I looked at a scenario where this issue actually occurred. The failing commit was added ~83 seconds before it was looked up. This does not prove that this race is responsible, but it certainly supports the claim.

Proposed fix

I will fix the lookup: whenever prepare fails to find a commit, load it from KV into commitsMap. This will definitely prevent this error from occurring (we perform consistent reads on KV).

A worry in production is that we no longer fail on finding a commit. If we end up looking for many commits via KV GET rather than LIST, then things will break down. This appears unlikely. I will add (only) a log for such a miss. Users can monitor for such a log or look for it after the fact. (Includes us in Cloud of course).