Open josharian opened 4 years ago
Interesting. Those backtraces are normal and just show a thread that has no goroutines to run. The next time this happens can you get backtraces from the other threads?
Also, go test
builds a test binary and runs it. Is it the test binary that is hanging, or the cmd/go binary? If the test binary, can you repeat it using go test -c
and running the test binary directly? Thanks.
I haven’t been able to reproduce since. I’ll keep trying.
This happened again. Some new information:
When this happens, it appears to be a systemwide bad state that affects all invocations of cmd/go, including different Go versions. (I tried 1.14.2 and tip.) I reproduced hangs with go get
, go test
, and go test -c
. Given that the hang is at the end of executing the tests, this all suggests to me some sort of lock or contention writing to a shared resource, probably the build cache.
This does eventually resolve itself, if you wait long enough. Here's the end of a run of go test -v strings
:
=== RUN ExampleBuilder
--- PASS: ExampleBuilder (0.00s)
PASS
ok strings 1.917s
go test -v strings 10.67s user 48.31s system 37% cpu 2:38.10 total 83360 maxrss
The last line is from zsh, via REPORTTIME, with TIMEFMT set to '%J %U user %S system %P cpu %*E total %M maxrss'
.
Observe that the test took ~2s to run, but according to the shell the process was alive well over 2 minutes.
bt all
a hung go test -c strings
process. Results:Something is weird about thread #9's backtrace.
Speculation: Could this just be cmd/go deleting a very large number of expired cache entries?
If so, maybe we need to bound the number of entries deleted during one auto-initiated run, and have cleanup happen incrementally.
cc @bcmills @jayconrod
Could this just be cmd/go deleting a very large number of expired cache entries?
Based on the stack and other comments, this seems likely.
If I remember correctly, the build cache expiration policy is purely time-based. There are some comments in src/cmd/go/internal/cache/cache.go. The go
command will attempt to trim the cache at most once per day. It will remove files that haven't been accessed in 5 days.
I don't believe there are any limits on the number or total size of files in the cache. It may not make sense to limit the size of the cache, but we might want to trim files more aggressively when it's very large (#29561).
Trimming incrementally also seems like a good solution. We could limit the time spent trimming for any command to, say 500 ms. We could save where we left off so the next command could restart from that point.
We could limit the time spent trimming for any command to, say 500 ms.
I don't think we should limit the trimming time to any fixed duration, because that could lead to a state where the trimming can't keep pace with the rate of creation.
That said, I would expect deleting a file to almost always be faster than creating it. Perhaps we could limit the cleaning time to some multiple of the rest of the running time of the go
command so far (with a lower bound to ensure that we can always do some cleaning).
Another option might be to kick off cleaning the cache at the beginning of the cmd/go run, and let it run concurrently with the rest of the command, except when running benchmarks.
I like that! But I worry it could be a bit racy — is it possible for part of the command to touch part of the cache and make it non-stale again?
It’s not that terrible to have to recalculate something that you just deleted—that would have happened anyway with slightly different timing of when you ran the commands.
I think the only problem is if you see something is in the cache, decide to use it, and then it gets cleaned up before you can use it. But if you can freshen it in the cache, and if the deletion goroutine always rechecks freshness before deletion, and if there are enough synchronization points, it should be ok. But that is a lot of ifs. :)
Yes, that's exactly the race I'm worried about. 🙂
I guess we could decide which portions of the cache to use, then freshen them, then kick off the cleaner while we do the remaining work. (But we intentionally don't rely on file-locking in the cache, and even if we did there is no atomic lock-check-and-delete operation, so there is in general no way to prevent the cleaner from racing with the consumer. So I don't think we can avoid the need to at least finish freshening the entries before we start the cleaner.)
Since this is all in one process, could we do our synchronization internally, in a thin intermediate layer?
Internally: yes. Thin: that's subjective. 😉
Fair enough. Happy to leave all these choices to you. :) I'd be happy enough with a simple 500ms time limit for automatically initialized cache cleanup.
@josharian, I noticed a similar issue tracking down #39183. I started with a test that ran lots of goroutines to try to provoke failures, and found that the test with N=10000 ran very quickly (~1s) but the go
command hung for a long time afterward (~10s), on every run (not just when there were existing cache entries to invalidate).
I wonder if there is a bottleneck in producing the cache key itself, rather than merely cleaning the stale entries.
@josharian, are you on a Mac? I am seeing this too and found that go clean -cache was significantly slower than rm -rf $(go env GOCACHE). I haven't chased down what's different though. When I kill -ABRT'ed the slow go clean -cache, it was in RemoveAll (as expected).
Yep, I'm on a mac.
This happened to me again and I found that even rm -rf $(go env GOCACHE) was taking a very long time. By very long time I mean it took 20 minutes to delete ~1,250,000 files.
On my Mac, fseventsd was writing down all the deleted file names in its own little databases, and then santa (https://github.com/google/santa) was writing down all the file accesses by fseventsd. If those weren't there I think rm would be faster. I tried telling Spotlight preferences that ~/Library/Caches was private and should not be indexed but it seems not to do anything re fseventsd.
Another option might be to kick off cleaning the cache at the beginning of the cmd/go run, and let it run concurrently with the rest of the command, except when running benchmarks.
One little comment to this approach. As I understand it, currently the cache trimming is now done at the end? This means it a cache older than 5 days might be used once?
I think this actually helps a lot in the case of Continuous Integration systems, they might cache things for more than 5 days (CircleCI for example does 15). If the cache would be invalidated before, purely based on being old, that would mean the CI would not be able to use this cache and it might slow down things unnecessarily.
I'm not saying design decisions should be based on this fact, but just something I have come across when trying to optimise CI build/test times and it might be worth being aware of at least.
This sometimes (quite often, in fact) happens to me as well, with go run
. The program won't terminate after Ctrl+C, no matter how many times I press it. kill
from another terminal kills the hanging process.
It seems that it has something to do with go cache. See cmd/go/internal/cache.(*Cache).trimSubdir
and cmd/go/internal/cache.(*Cache).Trim
:
(gdb) bt
#0 runtime/internal/syscall.Syscall6 () at /usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36
#1 0x0000000000404173 in syscall.RawSyscall6 (num=0, a1=18446744073709551516, a2=4211086, a3=18446744073709551516,
a4=824635707712, a5=0, a6=0, r1=<optimized out>, r2=<optimized out>, errno=<optimized out>)
at /usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38
#2 0x000000000047e01e in syscall.Syscall6 (trap=262, a1=18446744073709551516, a2=824635707712, a3=824635202936, a4=0, a5=0, a6=0,
r1=<optimized out>, r2=<optimized out>, err=<optimized out>) at /usr/local/go/src/syscall/syscall_linux.go:91
#3 0x000000000047d3da in syscall.fstatat (fd=-100, path=..., stat=0xc000169d78, flags=0, err=...)
at /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1439
#4 0x00000000004e0298 in syscall.Stat (path=..., stat=0xffffffffffffff9c, err=...)
at /usr/local/go/src/syscall/syscall_linux_amd64.go:59
#5 os.statNolog.func1 () at /usr/local/go/src/os/stat_unix.go:32
#6 os.ignoringEINTR (fn=<optimized out>) at /usr/local/go/src/os/file_posix.go:245
#7 os.statNolog (name=..., ~r0=..., ~r1=...) at /usr/local/go/src/os/stat_unix.go:31
#8 0x00000000004dfdd4 in os.Stat (name=..., ~r0=..., ~r1=...) at /usr/local/go/src/os/stat.go:13
#9 0x00000000007c7f85 in cmd/go/internal/cache.(*Cache).trimSubdir (c=<optimized out>, subdir=..., cutoff=...)
at /usr/local/go/src/cmd/go/internal/cache/cache.go:366
#10 0x00000000007c7c08 in cmd/go/internal/cache.(*Cache).Trim (c=0xc000131200)
at /usr/local/go/src/cmd/go/internal/cache/cache.go:334
@opennota perhaps Russ's generational cache approach would help here, by leading to a more incremental deletion pattern: https://github.com/golang/go/issues/29561#issuecomment-1255769245. If you try it, want to report back whether it also helps here?
@josharian Why would go run
"helpfully" trim the cache beyond what's needed to start the program at all? The user didn't ask for it.
(I believe that the Go users who have their gocache directory on an SSD do not notice any delay, but my gocache resides on a slow spinning disk.)
I'll try the patch out and let know if it helps.
Can't reproduce reliably, but...
After running
go test
, I am sometimes seeing the test complete successfully (prints "ok ... pkg ... time"), and then it appears to hang for an indefinite period of time. Sending SIGINT does not terminate the process. Sending SIGTERM does. This happens with std lib tests (strings, crypto/x509). Each time I passed the-v
flag, although I'm not sure that that is necessary. Using commit 5c13cab36b4667cc1a42667b16b8f049016586e0. This behavior is new, although this is the first time I've run tests in a few days or a week. And it doesn't reproduce 100%, so it's a bit muddled.I attached lldb to two hung binaries and got similar backtraces from each.
First backtrace
Second backtrace