golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.62k stars 17.61k forks source link

cmd/go: get of module already present in proxy can be slow #50158

Open myitcv opened 2 years ago

myitcv commented 2 years ago

What version of Go are you using (go version)?

$ go version
go version devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000 linux/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/tmp.1s97wmUiwJ/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build99146596=/tmp/go-build -gno-record-gcc-switches"

What did you do?

$ go get -x cuelang.org/go@v0.4.1-beta.4
# get https://proxy.golang.org/cuelang.org/@v/v0.4.1-beta.4.info
go: downloading cuelang.org/go v0.4.1-beta.4
# get https://proxy.golang.org/cuelang.org/go/@v/v0.4.1-beta.4.zip
# get https://proxy.golang.org/cuelang.org/go/@v/v0.4.1-beta.4.zip: 200 OK (0.202s)
# get https://proxy.golang.org/sumdb/sum.golang.org/supported
# get https://proxy.golang.org/sumdb/sum.golang.org/supported: 410 Gone (0.009s)
# get https://sum.golang.org/lookup/cuelang.org/go@v0.4.1-beta.4
# get https://sum.golang.org/lookup/cuelang.org/go@v0.4.1-beta.4: 200 OK (0.073s)
# get https://sum.golang.org/tile/8/0/x032/326
# get https://sum.golang.org/tile/8/2/000.p/127
# get https://sum.golang.org/tile/8/1/127.p/228
# get https://sum.golang.org/tile/8/1/126
# get https://sum.golang.org/tile/8/0/x032/740.p/38
# get https://sum.golang.org/tile/8/2/000.p/127: 200 OK (0.021s)
# get https://sum.golang.org/tile/8/0/x032/326: 200 OK (0.022s)
# get https://sum.golang.org/tile/8/1/126: 200 OK (0.023s)
# get https://sum.golang.org/tile/8/1/127.p/228: 200 OK (0.025s)
# get https://sum.golang.org/tile/8/0/x032/740.p/38: 200 OK (0.029s)
# get https://sum.golang.org/tile/8/0/x032/033
# get https://sum.golang.org/tile/8/0/x032/033: 200 OK (0.039s)
# get https://proxy.golang.org/cuelang.org/@v/v0.4.1-beta.4.info: 410 Gone (2.429s)
# get https://proxy.golang.org/cuelang.org/go/@v/v0.4.1-beta.4.mod
# get https://proxy.golang.org/cuelang.org/go/@v/v0.4.1-beta.4.mod: 200 OK (0.027s)
**********************
# get https://proxy.golang.org/cuelang.org/go/@v/list
# get https://proxy.golang.org/cuelang.org/go/@v/list: 200 OK (0.042s)
go: added cuelang.org/go v0.4.1-beta.4
go: added golang.org/x/exp v0.0.0-20210126221216-84987778548c
go: added golang.org/x/mod v0.3.1-0.20200828183125-ce943fd02449
go: added golang.org/x/tools v0.0.0-20200612220849-54c614fe050c
go: added golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543

The command stalled at the line ********************** for a good 15-20 secs.

What did you expect to see?

What did you see instead?

As abve.

Noting a subsequent call had output as follows:

$ go get -x cuelang.org/go@v0.4.1-beta.4
# get https://proxy.golang.org/cuelang.org/@v/v0.4.1-beta.4.info
# get https://proxy.golang.org/cuelang.org/@v/v0.4.1-beta.4.info: 410 Gone (2.159s)
# get https://proxy.golang.org/cuelang.org/go/@v/list
# get https://proxy.golang.org/cuelang.org/go/@v/list: 200 OK (0.045s)

Here the go get command ultimately took as long as it did (~2.5 secs) because of the long response time to:

# get https://proxy.golang.org/cuelang.org/@v/v0.4.1-beta.4.info: 410 Gone (2.159s)

Given go get already by this point got a successful response from:

# get https://proxy.golang.org/cuelang.org/go/@v/list: 200 OK (0.045s)

could cmd/go not optimise to simply "drop" the other requests, given the most specific one returned successfully?

cc @bcmills for cmd/go and @heschi in case there is any proxy-related issues here

bcmills commented 2 years ago

Yes, in theory we could return as soon as we find the package at the most precise path. That short-circuit isn't optimized because the shorter paths are supposed to return quickly — so the question is, why did the proxy take 2+ seconds to tell you about a version that trivially doesn't exist?

myitcv commented 2 years ago

so the question is, why did the proxy take 2+ seconds to tell you about a version that trivially doesn't exist?

Indeed. Plus the question above about the 15-20 secs delay we can't account for in the -x output.

heschi commented 2 years ago

Checking if a version exists or not takes some measurable amount of time, and the proxy's implementation creates some overhead. I'm not sure what "trivially doesn't exist" means, but if we have to run a go command and it has to talk to an origin server, 2 seconds does not seem unreasonable to me.

myitcv commented 2 years ago

@heschi

if we have to run a go command and it has to talk to an origin server, 2 seconds does not seem unreasonable to me.

Why would a go command call be necessary here? Because the proxy can't be sure that it's previously cached answer of "nothing there" is still valid or not?

bcmills commented 2 years ago

FWIW, the go command invocation in this case completes in ~100ms:

$ GOPROXY=direct time go list -m cuelang.org@v0.4.1-beta.4
go: cuelang.org@v0.4.1-beta.4: unrecognized import path "cuelang.org": parsing cuelang.org: XML syntax error on line 44: expected /> in element
Command exited with non-zero status 1
real    0:00.10
user    0.02
sys     0.01
rss     20944k

So that seems like 2+ seconds for container setup? Which I guess seems ok, but given the previous trace I would have expected the negative result to already be present in the cache.

bcmills commented 2 years ago

Indeed. Plus the question above about the 15-20 secs delay we can't account for in the -x output.

Agreed. If that happens again, could you kill the go command with SIGQUIT and post the goroutine dump (or send it to me)?

heschi commented 2 years ago

Because the proxy can't be sure that it's previously cached answer of "nothing there" is still valid or not?

Yes.

real 0:00.10

I don't believe this was done with an empty module cache. For me it takes anywhere from 300ms to 1.5s. But yes, there is quite a bit overhead proportional to the required work for something this small.

The behavior of the proxy is well within my expectations, so I'm going to drop out of the discussion now.

myitcv commented 2 years ago

Thanks, Heschi. Very much appreciate the clarifications.

Because the proxy can't be sure that it's previously cached answer of "nothing there" is still valid or not?

Yes.

@bcmills I think this points towards there being some merit in the short-circuit described earlier?

myitcv commented 2 years ago

@bcmills

could you kill the go command with SIGQUIT and post the goroutine dump (or send it to me)?

https://gist.github.com/myitcv/b7939fcc511a007af00297ceb39847b3

I triggered this about 7 seconds after the last line of output shown (didn't want to wait too much longer in case I missed the opportunity). What's interesting is that in this time the precise path hasn't returned.

What's even more interesting is that these calls shouldn't be necessary because I have this module in my module cache.

bcmills commented 2 years ago

Hmm. There isn't much actually running in that goroutine dump — one in a read syscall via modfetch.(*dbClient).ReadCache, and what appears to be a whole cluster of goroutines waiting on that same tile.

Nothing seems to have been blocked for a long time, so it isn't obvious to me whether we're bottlenecked on reading the module graph one time (or reading checksums one time), or reloading the graph (or reloading the subdb tiles) repeatedly for some reason.

A trace file from --debug-trace=go.trace might be interesting to see, but it might just tell us some functions that need more detailing trace instrumentation. 😅

myitcv commented 2 years ago

@bcmills - an update from my side. Turns out some of the issues I have been seeing are related to local disk errors (thankfully not actual disk errors, just problems with a NVM disk and the Linux kernel I am using).

So I'm going to attribute the unexplained long delays (15-20secs) to those disk errors. This is reasonable and consistent with other behaviour I was able to reliably reproduce.

That leaves two issues to my mind (hence I've retitled the issue):

1) why go get cuelang.org/go@v0.4.1-beta.4 in a new module is hitting the network at all when that exact version is available in my local module cache 2) whether cmd/go get should short-circuit when it gets a response from a more precise module path (per discussion above). Because when the proxy hasn't seen a request for cuelang.org/go "for some time" it ends up hitting the remote VCS to check whether a module exists at the path cuelang.org, and the client's cmd/go instance blocks until that returns. This delay can be measured in seconds, and exacerbates point 1.