Open bhearsum opened 7 months ago
@matt-boris and I dug into this a bunch today. Here's what we found:
run-task
just like the first run..cacherequires
file on it.There are two very confusing parts of this still: 1) Why does one task that's been spot killed end up with this error, while another one runs fine? 2) How exactly could get into this state. The runs that get spot killed get far enough that they have already written a .cacherequires file, which suggests that we should end up in the code path in run-task that finds a .cacherequires file
I think the next step here is to add a bunch of additional logging to run-task
to help inspect the state of the cache when we hit a failure like this. It _should_be fairly easy to achieve this by simulating host maintance events on workers.
I managed to reproduce this issue with additional debugging information in this task. In it, we have one cache configured:
cache:
translations-level-1-checkouts-v3-62df7d36d8ce4e58ade4-bEwo47EJT7-85VhUssFA3Q: /builds/worker/checkouts
My additional debugging consisting of dumping all the files in the cache directory, as well as the .cacherequires
and .cachelog
files. In the run that hit this issue, we do in fact see that /builds/worker/checkouts
has a checkout, but no .cacherequires
file.
Interestingly (and annoyingly), the immediate previous run was in a very odd state: it was claimed by a worker that didn't seem to exist in GCP (at least, not under the worker id that was given). It's not clear to me whether or not this is related to the issue, but it's coincidental enough to note here. (The previous time we saw this the errors came after a well handled spot termination.)
It does seem like there's a legitimate bug in run-task
somewhere, but this is very difficult to debug given the circumstances required to reproduce :(.
Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.
A few guesses:
run-task
's comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there? Maybe an fsync
would help?Thanks Dustin; you were the closest thing we had to an expert based on the blame :grimacing:
Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache.
In this case, we have a mounted cache volume, which (I believe) is used across multiple workers, which could explain this part?
A few guesses:
* Could multiple generic-worker instance be running in parallel, leading to run-task racing against itself? * Are these spot-terminated instances "rebooting" somehow, and coming back up with partially populated caches? * I notice `run-task`'s comments say this "shouldn't happen" based on an argument about the order things were done in the filesystem. But that ordering isn't always consistent in the face of mountings/unmountings, reboots, etc. - is it possible there's some race condition there? Maybe an `fsync` would help?
I'll check into these theories. The last in particular is an interesting theory. We already do things in run-task
to ensure that, eg: stdout/stderr is flushed, so it wouldn't surprise me if we need this sort of invalidation as well.
I apologize, I only barely remember this! But, my reading of the run-task
script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches. Instead, it's designed to handle caches shared between workers on the same task, from the cache
key in the payload. IIRC gps wrote this functionality (and especially the .cacherequires
bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.
So, I think the place to look when this occurs is the previous run on the same worker.
Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with .cacherequires
and .cachelog
and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.
I apologize, I only barely remember this!
Sorry, I only saw after pinging you that it was....a decade ago that you review it :grimacing:
But, my reading of the
run-task
script linked above is that it is not downloading anything. And the workers have no "network fileystem" of any sort that would let them share caches.
Right, right, thank you for pointing this out! I kept reading the cache definitions as a mount definition, but clearly this is not the case after a re-read.
Instead, it's designed to handle caches shared between workers on the same task, from the
cache
key in the payload. IIRC gps wrote this functionality (and especially the.cacherequires
bit) to avoid problems with some change to the task definition failing when run against old caches. I think caches also change names when their requirements change, so this should "never happen". Judging by the comments, when these requires do not match, the choice is to fail quickly when it does happen.So, I think the place to look when this occurs is the previous run on the same worker.
And indeed, that's right where we find this in my most recent tests:
Curiously though, the original task this was reported in don't seem to have this correlation. We no longer have history for those workers though unfortunately :(.
Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with
.cacherequires
and.cachelog
and then start diving into the nested directory structure of the checkout. So, if for some reason that processing is running on a cache at the same time that it is in use by a task, it might explain what you're seeing.
That does sound very plausible, indeed! Is garbageCollection what you're referring to here?
One thing I realized while looking at things just now is that the reason we don't hit this on the non-d2g tasks is because none of them have caches configured. The d2g ones have caches configured ostensibly for the docker images, but they end up getting used for the VCS checkout as well AFAICT.
That sounds like a promising lead! I don't know the relevant GW code, but garbageCollection
looks like what I was thinking of.
I had another quick look at this today. I think it's unlikely that garbage collection is a factor here, as we've been seeing this on instances where the task directory is located on a 300gb disk, and the earlier tasks certainly don't fill it up enough to get GC running. There's also no evidence in the log that GC runs prior to the cache error being detected.
In the abscence of other ideas, I'm tempted to say that the best path forward here may be to see if we can get rid of the checkout cache for the d2g tasks. This repository is small enough that is really doesn't give us much benefit.
I got inspired and ended up setting up my own worker with some extra instrumentation. I'm pretty darn sure this is a generic-worker bug at this point; I've filed https://github.com/taskcluster/taskcluster/issues/7128 about it.
As an update here, we've landed a couple of fixes upstream already. We're waiting on one edge case to be addressed, after which we'll be unblocked on picking up a new generic worker version that should be free of this issue.
With some help from the Taskcluster team, I did some testing with the latest version of generic-worker, and it appears to fix this issue. We'll need to upgrade the version we're using on our images, and then we should be able to call this fixed, and switch to d2g/generic-worker for CPU tasks. (Switching GPU tasks to d2g/generic-worker is https://github.com/mozilla/translations/issues/710.)
CPU workers are switching to the new image in https://github.com/mozilla/translations/pull/949. We still have a bit more work to do on the image before it's ready for the GPU workers.
For example: https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg
We had an initial run of this task which got spot killed. Subsequent runs failed with: