gopherbot commented 3 months ago

#!watchflakes
default <- pkg == "compress/flate" && test == ""

Issue created automatically to collect these failures.

Example (log):

SIGSEGV: segmentation violation
PC=0x10022034c m=6 sigcode=2 addr=0x20

goroutine 0 gp=0x140000848c0 m=6 mp=0x14000054408 [idle]:
runtime.(*mspan).typePointersOfUnchecked(0x140006f0e80?, 0x14000380600?)
    /tmp/buildlet/go/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0x16ff36d90 sp=0x16ff36d70 pc=0x10022034c
runtime.scanobject(0x14003542000, 0x1400002d258)
    /tmp/buildlet/go/src/runtime/mgcmark.go:1439 +0x1c4 fp=0x16ff36e20 sp=0x16ff36d90 pc=0x10022c264
runtime.gcDrain(0x1400002d258, 0x3)
    /tmp/buildlet/go/src/runtime/mgcmark.go:1240 +0x1d4 fp=0x16ff36e90 sp=0x16ff36e20 pc=0x10022ba04
...
r24     0x0
r25     0x1400002d2a0
r26     0xffffffffffffffff
r27     0xffffffffffffffe8
r28     0x140000848c0
r29     0x16ff36d68
lr      0x10022c264
sp      0x16ff36d70
pc      0x10022034c
fault   0x20

— watchflakes

gopherbot commented 3 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "compress/flate" && test == ""

2024-03-18 17:06 darwin-arm64-12 go@dda4b17e compress/flate (log)

SIGSEGV: segmentation violation PC=0x10022034c m=6 sigcode=2 addr=0x20 goroutine 0 gp=0x140000848c0 m=6 mp=0x14000054408 [idle]: runtime.(*mspan).typePointersOfUnchecked(0x140006f0e80?, 0x14000380600?) /tmp/buildlet/go/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0x16ff36d90 sp=0x16ff36d70 pc=0x10022034c runtime.scanobject(0x14003542000, 0x1400002d258) /tmp/buildlet/go/src/runtime/mgcmark.go:1439 +0x1c4 fp=0x16ff36e20 sp=0x16ff36d90 pc=0x10022c264 runtime.gcDrain(0x1400002d258, 0x3) /tmp/buildlet/go/src/runtime/mgcmark.go:1240 +0x1d4 fp=0x16ff36e90 sp=0x16ff36e20 pc=0x10022ba04 ... r24 0x0 r25 0x1400002d2a0 r26 0xffffffffffffffff r27 0xffffffffffffffe8 r28 0x140000848c0 r29 0x16ff36d68 lr 0x10022c264 sp 0x16ff36d70 pc 0x10022034c fault 0x20

— watchflakes

aciduck commented 3 months ago

This happens to us as well, since upgrading from Go 1.21 to 1.22.1. In a multi-tenant environment this constantly reproduces just for a single tenant in a single component, after a variable runtime of 30 minutes - 4 hours. We couldn't find a correlation to specific business logic actions, but if this is related to the compression package it might just be networking.

$ go version
go version go1.22.1 linux/arm64

An example of such an error is attached - extract-2024-03-27T16_36_44.051Z.txt

aciduck commented 3 months ago

Update: our code was using the builtin gzip package specifically in this use case, and for a specific tenant, to compress large amounts of data. Once we disabled it the crash disappeared. This seems to match the original failure in this ticket, sugging issues in the compression code.

dany74q commented 2 months ago

Ping, happened to me as well

AdallomRoy commented 2 months ago

Ping, this is still happening

seankhliao commented 2 months ago

https://go.dev/wiki/NoPlusOne

mknyszek commented 2 months ago

I missed the comments here from back in late March/early April, but to be absolutely clear, this failure happened on our old macOS infrastructure which was known for having memory corruption bugs that haven't been present since we switched that infrastructure (though, there were other issues, but they didn't look like this).

For everyone commenting here, I think we need much more information to be able to do anything about this.

What platform did it happen to you on?
Which Go version was used to build the code?
Can you provide stack traces (even just the runtime parts are helpful)?
Do you have a way to reproduce?

Thanks in advance.

mknyszek commented 2 months ago

(My only suspicion here is that there's a mishandling of "GC progs" which are used for the GC metadata for very large objects. I know for certain that compress/flate has one such object. A reproducer would go a very long way to resolving this.)

aciduck commented 2 months ago

Hello @mknyszek, thank you for the response. Attached is a crash stack trace from today - extract-2024-05-08T09_29_41.780Z.txt The code is running in an ARM64 Linux container, built using Go 1.22.2.

We don't have a small reproducer, because it doesn't happen on small payloads. It only happens for a single tenant in our environment, but for them it happens on a daily basis. So we know it depends on the actual data being compressed.

mknyszek commented 2 months ago

Thanks. I looked at the failing line it looks like the type information is nil. arm64 is an interesting platform to fail on because of the weaker memory model -- perhaps GC prog objects are becoming visible to the GC sooner? (Assuming that is the codepath being taken here, which is still a bit unclear.)

Go 1.23 actually already has a mitigation for this: if the type is nil, scanning is skipped (this is to support a different use-case). It's possible you just might not see the issue with Go 1.23 already (that is, if you try building Go from tip-of-tree).

mknyszek commented 2 months ago

67255 may be related. It can produce the same failure and a fix was recently landed. The situation is an append of a 0-sized slice, which may be something `compress/flate` does.

randall77 commented 2 months ago

I tried reverting my CL for 67255 and adding this hack to the runtime:

diff --git a/src/runtime/mbarrier.go b/src/runtime/mbarrier.go
index dc6922da54..f897b2928b 100644
--- a/src/runtime/mbarrier.go
+++ b/src/runtime/mbarrier.go
@@ -367,6 +367,30 @@ func reflect_typedarrayclear(typ *_type, ptr unsafe.Pointer, len int) {
 //
 //go:nosplit
 func memclrHasPointers(ptr unsafe.Pointer, n uintptr) {
+       if n == 0 && ptr != nil {
+               p := uintptr(ptr)
+               s := spanOf(p)
+               if s == nil {
+                       println(ptr)
+                       panic("no span")
+               }
+               if s.state.get() != mSpanInUse {
+                       panic("span not in use")
+               }
+               i := s.objIndex(p)
+               if i >= uintptr(s.nelems) {
+                       panic("index too big")
+               }
+               if s.isFree(i) {
+                       panic("free object!")
+               }
+               if s.elemsize > 512 && s.spanclass.sizeclass() != 0 {
+                       q := s.objBase(i)
+                       if p == q {
+                               panic("header pointer")
+                       }
+               }
+       }
        // Pass nil for the type since we don't have one here anyway.
        bulkBarrierPreWrite(uintptr(ptr), 0, n, nil)
        memclrNoHeapPointers(ptr, n)

And all.bash almost passes. Just some tests in test/append.go trigger it. (As does my new test for 67255 if I add that in.) In particular, compress/flate tests don't trigger it. So it appears unlikely that #67255 is the direct cause, although it is hard to be 100% sure.

randall77 commented 2 months ago

... but I'm still betting that a pointer-past-the-end bug somewhere is ultimately responsible.

mknyszek commented 1 month ago

@aciduck Would it be possible for you to try out different things to help narrow down the root cause?

If you set GODEBUG=asyncpreemptoff=1, does that change anything? The fact that it's arm64 is fishy to me, and I'm concerned it's some kind of rare synchronization bug due to the weak memory model.
If you're open to trying out a patch, it would be good to confirm or deny my suspicion about it GC progs at fault. I can prepare that shortly.
What happens if you try Go at tip-of-tree? As I mentioned before, this may have been unintentionally mitigated already, and if so, it might be that it's just fixed (which may also be a clue as to what's going on). Simultaneously, if instead of this failure, you start to see new ones (looking more like random memory corruption), that's also a hint that something is very wrong.

Since you don't have a small reproducer, I assume the only way to try this out is to just run with these modes for a while and see what happens? Meanwhile I'll try to reproduce myself.

aciduck commented 1 month ago

Hi @mknyszek, we have rolled back the updated config and have been trying to reproduce the issue for two weeks now, without success. As mentioned before, it happened for a single tenant in a multi-tenant environment, so the problem was always related to a specific data set. The data seems to have changed in the past month, and it doesn't cause the issue anymore.

valyala commented 3 weeks ago

FYI, this panic is also reproduced in Go1.22.4 on amd64 architecture, while it cannot be reproduced in Go1.21 - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6425

We are going to build the app with Go tip and see whether it is reproduced there.

valyala commented 2 weeks ago

Update: the panic isn't reproducible in Go1.23rc1 - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6425#issuecomment-2190417000 . So, it looks like the issue has been already fixed in Go1.23.

golang / go

compress/flate: unrecognized failures #66383

67255 may be related. It can produce the same failure and a fix was recently landed. The situation is an append of a 0-sized slice, which may be something `compress/flate` does.

golang / go

compress/flate: unrecognized failures #66383

67255 may be related. It can produce the same failure and a fix was recently landed. The situation is an append of a 0-sized slice, which may be something compress/flate does.

67255 may be related. It can produce the same failure and a fix was recently landed. The situation is an append of a 0-sized slice, which may be something `compress/flate` does.