Closed mvdan closed 4 years ago
I forgot to mention the env vars. We run the following:
go env -w GOPRIVATE='brank.as/*' CGO_ENABLED=0 GOCACHE=/root/openbank-services/.cache/gocache GOBIN=/root/bin GOFLAGS=-mod=readonly
export GOPATH="/root/openbank-services/.cache/gopath"
From the filenames involved in the error, it seems likely that the failing call is this one: https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L122
That seems to imply one of the following possibilities:
The call to lockVersion
that guards the os.Rename
is failing to actually provide mutual exclusion. (Perhaps the filesystem's flock
implementation erroneously reports success without actually locking the file?)
https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L76
The call to os.Stat
after lockVersion
is erroneously returning a non-nil error even though the directory exists and is visible to the current user. (Perhaps the filesystem is not providing a consistent ordering between an flock
on the lockfile and a rename
on the corresponding directory?)
https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L83
Either way, given the information we have so far this seems more likely to be a bug in the underlying filesystem than in the go
command. (Of course, with more information that could change.)
Could you try running go test cmd/go/internal/lockedfile/... cmd/go/internal/renameio cmd/go/internal/robustio
with a fairly high -count
argument and a TMPDIR
on the same filesystem, and see if that turns anything up?
I'd be surprised if our setup was to blame, because another of our CI pipelines does run many
cmd/go
commands concurrently with shared$GOPATH
and$GOCACHE
via the same volume setup.
Note that the concurrency strategy for GOCACHE
uses idempotent writes rather than file-locking, so an flock
bug would generally only turn up for operations that download to the module cache.
(We use file-locking in the module cache because idempotent writes would be significantly less efficient in many cases, and because it is otherwise difficult to signal that a downloaded module is complete and ready for use. In contrast, within GOCACHE
we record lengths and checksums for the individual files to be written, so we can detect an incomplete file by a bad checksum or length.)
this seems more likely to be a bug in the underlying filesystem than in the
go
command.
This was my initial suspicion, but we're using a pretty recent stable Docker on the most recent Ubuntu LTS, with an ext4 disk. It doesn't get more standard and stable than this, I think.
Note that the concurrency strategy for
GOCACHE
uses idempotent writes rather than file-locking, so anflock
bug would generally only turn up for operations that download to the module cache.
That's a good point. Though the other CI builds could do concurrent module fetches, if the cache isn't up to date. It's this build that's causing problems that doesn't have any concurrent steps whatsoever. Which is why I'm extra confused.
(Of course, with more information that could change.)
I realised this issue wouldn't have much actionable for you, but I still filed it in case you saw something that I didn't. And in case others would find it useful in the future, if they encounter the same error.
I'll give those go test
tips a go, for now. In any case, I'm happy to close this after a week as "needs more info" if you're pretty sure the code is correct. I can always reopen if I find anything new.
Ok, wow, this is beyond embarassing. The CI config was buggy; someone had messed with it while I was away on vacation, and they removed the dependency between the "run go test
" and "restore the cache" steps.
I did look at that twice, but of course, I'm only human :(
Apologies for the noise and the waste of time. This is definitely a filesystem data race that's entirely our fault.
This happens sporadically on
golang:1.13.5
withDocker version 19.03.5, build 633a0ea838
and Linux4.15.0-72-generic #81-Ubuntu
.It's happened on a CI build job three times in the past week, for a job that runs twice per hour. So, roughly, about 1% of the time. I haven't been able to reliably reproduce the error, nor do we run these jobs with Go tip.
Unfortunately, this is happening with a piece of internal end-to-end testing, so its source and build jobs are not public.
Here is the log, since it doesn't contain any sensitive info:
The
gopath
directory in question is cached between builds. The way we do that is by atomically storing atar.zst
archive of the$HOME/.cache/
directory at the end of a successful build, and extracting it at the start.It should be noted that this
go test
docker container does not share any volumes with other docker containers, e.g. other concurrentgo test
commands. Because of how this CI system is designed,$HOME
is a volume, because it needs to persist between build steps. Perhaps this affects how the filesystem works, since$GOPATH
is under it.I tried to do some debugging, but failed to find the cause so far. Here is a summary:
/root/openbank-services/.cache/gopath/pkg/mod/github.com/gogo/protobuf@v1.2.2-0.20190723190241-65acae22fc9d
exists and looks correct. Though this might be a newer archive.src/cmd/go/internal/modfetch/fetch.go
, and the locking and renaming of the directory looks non-racy to me.fetch.go
would error immediately if locking wasn't supported, instead of silently using no locking.err == nil && fi.IsDir()
and then justos.Rename
. But I guess this scenario would mean that$GOPATH
got corrupted.I'd be surprised if our setup was to blame, because another of our CI pipelines does run many
cmd/go
commands concurrently with shared$GOPATH
and$GOCACHE
via the same volume setup. We've run thousands of those jobs in the past month alone, and I don't recall a single error like this./cc @bcmills @jayconrod