Closed N-o-Z closed 1 month ago
The commit itself exists but is new (see below). Retrying -- even immediately -- appears to help. And this is definitely flakiness, otherwise once it failed it would be hard for the issue to go away - I would expect to meet such an issue a lot more. There are no races in the GC thread itself as it is single-threaded. So a race has to be with another thread of execution (not a "POSIX" thread, just a separate concurrent sequence of steps).
So I'm looking for a race. Seek not, find not to quote some teachings (אם לא תחפש לא תמצא). Indeed, I found a race which seems plausible - so I will fix that race.
Thread 1 | Thread 2 |
---|---|
Start preparing GC | |
Start listing commits | |
...list... | add commit c0ffee |
Start tracking commits from branch starting points | |
Attempt to read commit c0ffee |
The new commit c0ffee
might not be present in the list of commits. If it isn't, we will see the exact failure described.
I looked at a scenario where this issue actually occurred. The failing commit was added ~83 seconds before it was looked up. This does not prove that this race is responsible, but it certainly supports the claim.
I will fix the lookup: whenever prepare fails to find a commit, load it from KV into commitsMap. This will definitely prevent this error from occurring (we perform consistent reads on KV).
A worry in production is that we no longer fail on finding a commit. If we end up looking for many commits via KV GET rather than LIST, then things will break down. This appears unlikely. I will add (only) a log for such a miss. Users can monitor for such a log or look for it after the fact. (Includes us in Cloud of course).
Spark error:
Digging into the code - the failure came from
active_commits.go
GetGarbageCollectionCommits
line 58:Seems like the commit was not found in the commitsMap. The retryer in the Scala API client is masking an error that might indicate a potential bug in lakeFS as we see this flow being retried several times until success or until retry exhaustion. This might be an indication for a bug in the
GCStartingPointIterator
or the logic which populatescommitsMap