Open gopherbot opened 3 months ago
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "compress/flate" && test == ""
This happens to us as well, since upgrading from Go 1.21 to 1.22.1. In a multi-tenant environment this constantly reproduces just for a single tenant in a single component, after a variable runtime of 30 minutes - 4 hours. We couldn't find a correlation to specific business logic actions, but if this is related to the compression package it might just be networking.
$ go version go version go1.22.1 linux/arm64
An example of such an error is attached - extract-2024-03-27T16_36_44.051Z.txt
Update: our code was using the builtin gzip package specifically in this use case, and for a specific tenant, to compress large amounts of data. Once we disabled it the crash disappeared. This seems to match the original failure in this ticket, sugging issues in the compression code.
Ping, happened to me as well
Ping, this is still happening
I missed the comments here from back in late March/early April, but to be absolutely clear, this failure happened on our old macOS infrastructure which was known for having memory corruption bugs that haven't been present since we switched that infrastructure (though, there were other issues, but they didn't look like this).
For everyone commenting here, I think we need much more information to be able to do anything about this.
Thanks in advance.
(My only suspicion here is that there's a mishandling of "GC progs" which are used for the GC metadata for very large objects. I know for certain that compress/flate
has one such object. A reproducer would go a very long way to resolving this.)
Hello @mknyszek, thank you for the response. Attached is a crash stack trace from today - extract-2024-05-08T09_29_41.780Z.txt The code is running in an ARM64 Linux container, built using Go 1.22.2.
We don't have a small reproducer, because it doesn't happen on small payloads. It only happens for a single tenant in our environment, but for them it happens on a daily basis. So we know it depends on the actual data being compressed.
Thanks. I looked at the failing line it looks like the type information is nil. arm64 is an interesting platform to fail on because of the weaker memory model -- perhaps GC prog objects are becoming visible to the GC sooner? (Assuming that is the codepath being taken here, which is still a bit unclear.)
Go 1.23 actually already has a mitigation for this: if the type is nil, scanning is skipped (this is to support a different use-case). It's possible you just might not see the issue with Go 1.23 already (that is, if you try building Go from tip-of-tree).
compress/flate
does.I tried reverting my CL for 67255 and adding this hack to the runtime:
diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index dc6922da54..f897b2928b 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -367,6 +367,30 @@ func reflect_typedarrayclear(typ *_type, ptr unsafe.Pointer, len int) {
//
//go:nosplit
func memclrHasPointers(ptr unsafe.Pointer, n uintptr) {
+ if n == 0 && ptr != nil {
+ p := uintptr(ptr)
+ s := spanOf(p)
+ if s == nil {
+ println(ptr)
+ panic("no span")
+ }
+ if s.state.get() != mSpanInUse {
+ panic("span not in use")
+ }
+ i := s.objIndex(p)
+ if i >= uintptr(s.nelems) {
+ panic("index too big")
+ }
+ if s.isFree(i) {
+ panic("free object!")
+ }
+ if s.elemsize > 512 && s.spanclass.sizeclass() != 0 {
+ q := s.objBase(i)
+ if p == q {
+ panic("header pointer")
+ }
+ }
+ }
// Pass nil for the type since we don't have one here anyway.
bulkBarrierPreWrite(uintptr(ptr), 0, n, nil)
memclrNoHeapPointers(ptr, n)
And all.bash
almost passes. Just some tests in test/append.go
trigger it. (As does my new test for 67255 if I add that in.) In particular, compress/flate tests don't trigger it. So it appears unlikely that #67255 is the direct cause, although it is hard to be 100% sure.
... but I'm still betting that a pointer-past-the-end bug somewhere is ultimately responsible.
@aciduck Would it be possible for you to try out different things to help narrow down the root cause?
GODEBUG=asyncpreemptoff=1
, does that change anything? The fact that it's arm64 is fishy to me, and I'm concerned it's some kind of rare synchronization bug due to the weak memory model.Since you don't have a small reproducer, I assume the only way to try this out is to just run with these modes for a while and see what happens? Meanwhile I'll try to reproduce myself.
Hi @mknyszek, we have rolled back the updated config and have been trying to reproduce the issue for two weeks now, without success. As mentioned before, it happened for a single tenant in a multi-tenant environment, so the problem was always related to a specific data set. The data seems to have changed in the past month, and it doesn't cause the issue anymore.
FYI, this panic is also reproduced in Go1.22.4 on amd64 architecture, while it cannot be reproduced in Go1.21 - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6425
We are going to build the app with Go tip and see whether it is reproduced there.
Update: the panic isn't reproducible in Go1.23rc1 - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6425#issuecomment-2190417000 . So, it looks like the issue has been already fixed in Go1.23.
Issue created automatically to collect these failures.
Example (log):
— watchflakes