cmd/go: "file exists" errors when trying to fetch modules

mvdan commented 4 years ago

This happens sporadically on golang:1.13.5 with Docker version 19.03.5, build 633a0ea838 and Linux 4.15.0-72-generic #81-Ubuntu.

go version go1.13.5 linux/amd64

It's happened on a CI build job three times in the past week, for a job that runs twice per hour. So, roughly, about 1% of the time. I haven't been able to reliably reproduce the error, nor do we run these jobs with Go tip.

Unfortunately, this is happening with a piece of internal end-to-end testing, so its source and build jobs are not public.

Here is the log, since it doesn't contain any sensitive info:

$ go test
go: extracting github.com/emersion/go-imap v1.0.2
go: extracting github.com/chromedp/chromedp v0.5.2
go: extracting github.com/imdario/mergo v0.3.8
go: extracting github.com/davecgh/go-spew v1.1.1
go: extracting github.com/spf13/pflag v1.0.5
go: extracting github.com/gogo/protobuf v1.2.2-0.20190723190241-65acae22fc9d
go: extracting github.com/google/gofuzz v1.0.0
go: extracting github.com/chromedp/cdproto v0.0.0-20191114225735-6626966fbae4
go: extracting github.com/modern-go/reflect2 v1.0.1
go: extracting github.com/mailru/easyjson v0.7.0
go: extracting github.com/gobwas/ws v1.0.2
go: extracting github.com/json-iterator/go v1.1.8
go: extracting github.com/emersion/go-sasl v0.0.0-20190817083125-240c8404624e
go: extracting github.com/gobwas/pool v0.2.0
go: extracting github.com/golang/protobuf v1.3.2
go: extracting github.com/knq/sysutil v0.0.0-20191005231841-15668db23d08
go: extracting github.com/gobwas/httphead v0.0.0-20180130184737-2c6c146eadee
go: extracting github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd
# _endtoend
package _endtoend (test)
        imports k8s.io/client-go/kubernetes
        imports k8s.io/client-go/discovery
        imports k8s.io/apimachinery/pkg/api/errors
        imports k8s.io/apimachinery/pkg/apis/meta/v1
        imports github.com/gogo/protobuf/proto: rename /root/openbank-services/.cache/gopath/pkg/mod/github.com/gogo/protobuf@v1.2.2-0.20190723190241-65acae22fc9d.tmp-229801694 /root/openbank-services/.cache/gopath/pkg/mod/github.com/gogo/protobuf@v1.2.2-0.20190723190241-65acae22fc9d: file exists
FAIL    _endtoend [setup failed]

The gopath directory in question is cached between builds. The way we do that is by atomically storing a tar.zst archive of the $HOME/.cache/ directory at the end of a successful build, and extracting it at the start.

It should be noted that this go test docker container does not share any volumes with other docker containers, e.g. other concurrent go test commands. Because of how this CI system is designed, $HOME is a volume, because it needs to persist between build steps. Perhaps this affects how the filesystem works, since $GOPATH is under it.

I tried to do some debugging, but failed to find the cause so far. Here is a summary:

I manually pulled the latest cache archive 10m later, and the directory /root/openbank-services/.cache/gopath/pkg/mod/github.com/gogo/protobuf@v1.2.2-0.20190723190241-65acae22fc9d exists and looks correct. Though this might be a newer archive.
I looked at the code in src/cmd/go/internal/modfetch/fetch.go, and the locking and renaming of the directory looks non-racy to me.
Maybe the filesystem or docker volume don't support file locking? Though as far as I can tell, fetch.go would error immediately if locking wasn't supported, instead of silently using no locking.
Maybe the module in question is special in some way? I haven't been able to dig up logs from previous occurrences, so I'm not sure yet.
Maybe the target existed, but wasn't a directory? Technically possible, given that the code does err == nil && fi.IsDir() and then just os.Rename. But I guess this scenario would mean that $GOPATH got corrupted.

I'd be surprised if our setup was to blame, because another of our CI pipelines does run many cmd/go commands concurrently with shared $GOPATH and $GOCACHE via the same volume setup. We've run thousands of those jobs in the past month alone, and I don't recall a single error like this.

/cc @bcmills @jayconrod

mvdan commented 4 years ago

I forgot to mention the env vars. We run the following:

go env -w GOPRIVATE='brank.as/*' CGO_ENABLED=0 GOCACHE=/root/openbank-services/.cache/gocache GOBIN=/root/bin GOFLAGS=-mod=readonly
export GOPATH="/root/openbank-services/.cache/gopath"

bcmills commented 4 years ago

From the filenames involved in the error, it seems likely that the failing call is this one: https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L122

That seems to imply one of the following possibilities:

The call to lockVersion that guards the os.Rename is failing to actually provide mutual exclusion. (Perhaps the filesystem's flock implementation erroneously reports success without actually locking the file?) https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L76
The call to os.Stat after lockVersion is erroneously returning a non-nil error even though the directory exists and is visible to the current user. (Perhaps the filesystem is not providing a consistent ordering between an flock on the lockfile and a rename on the corresponding directory?) https://github.com/golang/go/blob/daacf269802eaa856705063159b55b5e752e7058/src/cmd/go/internal/modfetch/fetch.go#L83

Either way, given the information we have so far this seems more likely to be a bug in the underlying filesystem than in the go command. (Of course, with more information that could change.)

Could you try running go test cmd/go/internal/lockedfile/... cmd/go/internal/renameio cmd/go/internal/robustio with a fairly high -count argument and a TMPDIR on the same filesystem, and see if that turns anything up?

bcmills commented 4 years ago

I'd be surprised if our setup was to blame, because another of our CI pipelines does run many cmd/go commands concurrently with shared $GOPATH and $GOCACHE via the same volume setup.

Note that the concurrency strategy for GOCACHE uses idempotent writes rather than file-locking, so an flock bug would generally only turn up for operations that download to the module cache.

(We use file-locking in the module cache because idempotent writes would be significantly less efficient in many cases, and because it is otherwise difficult to signal that a downloaded module is complete and ready for use. In contrast, within GOCACHE we record lengths and checksums for the individual files to be written, so we can detect an incomplete file by a bad checksum or length.)

mvdan commented 4 years ago

this seems more likely to be a bug in the underlying filesystem than in the go command.

This was my initial suspicion, but we're using a pretty recent stable Docker on the most recent Ubuntu LTS, with an ext4 disk. It doesn't get more standard and stable than this, I think.

Note that the concurrency strategy for GOCACHE uses idempotent writes rather than file-locking, so an flock bug would generally only turn up for operations that download to the module cache.

That's a good point. Though the other CI builds could do concurrent module fetches, if the cache isn't up to date. It's this build that's causing problems that doesn't have any concurrent steps whatsoever. Which is why I'm extra confused.

(Of course, with more information that could change.)

I realised this issue wouldn't have much actionable for you, but I still filed it in case you saw something that I didn't. And in case others would find it useful in the future, if they encounter the same error.

I'll give those go test tips a go, for now. In any case, I'm happy to close this after a week as "needs more info" if you're pretty sure the code is correct. I can always reopen if I find anything new.

mvdan commented 4 years ago

Ok, wow, this is beyond embarassing. The CI config was buggy; someone had messed with it while I was away on vacation, and they removed the dependency between the "run go test" and "restore the cache" steps.

I did look at that twice, but of course, I'm only human :(

Apologies for the noise and the waste of time. This is definitely a filesystem data race that's entirely our fault.

golang / go

cmd/go: "file exists" errors when trying to fetch modules #36447