golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.25k stars 17.57k forks source link

runtime: fatal error: sweep increased allocation count go 1.10.1 windows/amd64 #24881

Closed neclepsio closed 6 years ago

neclepsio commented 6 years ago

What version of Go are you using (go version)?

go version go1.10.1 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Ignazio\AppData\Local\go-build
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=D:\Ignazio\Home\Go-Workspace
set GORACE=
set GOROOT=D:\Ignazio\IgnPack\Go\Go110
set GOTMPDIR=
set GOTOOLDIR=D:\Ignazio\IgnPack\Go\Go110\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Ignazio\AppData\Local\Temp\go-build060324558=/tmp/go-build -gno-record-gcc-switches

What did you do?

I'm developing an opengl program using go-qml; when loading complex models the runtime it often crashes using version 1.10.1,1.10, 1.9.5 (in which a similar issue, #22781, has been fixed), 1.8.7.

It seems to work in 1.7.6.

What did you expect to see?

No error.

What did you see instead?

A crash:

runtime: nelems=51 nalloc=3 previous allocCount=2 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x729e6c, 0x20)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/panic.go:616 +0x88
runtime.(*mspan).sweep(0x3c63a0, 0x3c6300, 0x1040000)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:298 +0xa57
runtime.sweepone(0x0)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:113 +0x127
runtime.(*mheap).reclaim(0x8fd7c0, 0x1)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mheap.go:640 +0x15f
runtime.(*mheap).alloc_m(0x8fd7c0, 0x1, 0x8fb000a, 0x429f82)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mheap.go:674 +0x8e
runtime.(*mheap).alloc.func1()
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mheap.go:753 +0x54
runtime.(*mheap).alloc(0x8fd7c0, 0x1, 0x901000a, 0x8fbfd98)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mheap.go:752 +0x91
runtime.(*mcentral).grow(0x8fed90, 0x0)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mcentral.go:232 +0xd8
runtime.(*mcentral).cacheSpan(0x8fed90, 0x54a7ce0)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mcentral.go:106 +0x2f6
runtime.(*mcache).refill(0x3c0000, 0x100a)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/mcache.go:123 +0xaa
runtime.(*mcache).nextFree.func1()
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/malloc.go:556 +0x39
runtime.systemstack(0x0)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:409 +0x7e
runtime.mstart()
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/proc.go:1175

goroutine 540 [running]:
runtime.systemstack_switch()
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:363 fp=0xc0426093a8 sp=0xc0426093a0 pc=0x457940
runtime.(*mcache).nextFree(0x3c0000, 0xa, 0x9a, 0x9a, 0xffffffff)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/malloc.go:555 +0xb0 fp=0xc042609408 sp=0xc0426093a8 pc=0x414090
runtime.mallocgc(0x40, 0x6c94a0, 0x1, 0xa)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/malloc.go:710 +0x6b5 fp=0xc0426094b0 sp=0xc042609408 pc=0x414925
runtime.makeslice(0x6c94a0, 0x4, 0x4, 0xa, 0x732a40, 0x0)
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/slice.go:61 +0x7b fp=0xc0426094e0 sp=0xc0426094b0 pc=0x4428eb
strings.Fields(0xc0442779e5, 0xa, 0x0, 0x0, 0x0)
    D:/Ignazio/IgnPack/Go/Go110/src/strings/strings.go:335 +0x15b fp=0xc042609610 sp=0xc0426094e0 pc=0x516aeb
idn/glscene/shapes.ReadVRML(0xc042203450, 0x47, 0x0, 0x0, 0x0)
    D:/Ignazio/Home/Go-Workspace/src/idn/glscene/shapes/vrml.go:58 +0x528 fp=0xc042609d80 sp=0xc042609610 pc=0x6164b8
main.addModelsFromDir.func1(0xc0421f18c0, 0x3e, 0x743ee0, 0xc0421fec60, 0xc0424747e0, 0x4, 0x4, 0xc042488720, 0xc0424747f0)
    D:/Ignazio/Home/Go-Workspace/src/idp/sketches.v2/application.go:364 +0x17c fp=0xc042609f98 sp=0xc042609d80 pc=0x69a32c
runtime.goexit()
    D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc042609fa0 sp=0xc042609f98 pc=0x45a571
created by main.addModelsFromDir
    D:/Ignazio/Home/Go-Workspace/src/idp/sketches.v2/application.go:363 +0x338

[some outer goroutines not running]

fatal error: exitsyscall: syscall frame is no longer valid
Failed to continue.

Another:

runtime: nelems=23 nalloc=2 previous allocCount=1 nfreed=65535
fatal error: sweep increased allocation count

runtime stack:
runtime.throw(0x67cdde, 0x20)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/panic.go:616 +0x88
runtime.(*mspan).sweep(0xed22fd0, 0x0, 0xc042028900)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:298 +0x8b1
runtime.sweepone(0x434e9d)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:113 +0x126
runtime.gosweepone.func1()
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:137 +0x32
runtime.systemstack(0x0)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:409 +0x7e
runtime.mstart()
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/proc.go:1175

goroutine 3 [running]:
runtime.systemstack_switch()
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:363 fp=0xc04202bf70 sp=0xc04202bf68 pc=0x451bc0
runtime.gosweepone(0x0)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:136 +0x51 fp=0xc04202bfa0 sp=0xc04202bf70 pc=0x4203f1
runtime.bgsweep(0xc042040000)
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgcsweep.go:55 +0xc3 fp=0xc04202bfd8 sp=0xc04202bfa0 pc=0x41ff33
runtime.goexit()
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc04202bfe0 sp=0xc04202bfd8 pc=0x4547f1
created by runtime.gcenable
        D:/Ignazio/IgnPack/Go/Go110/src/runtime/mgc.go:216 +0x5f

Please help me sharing all the data you need; I have not been able write a reproducer and I cannot share source or models.

ALTree commented 6 years ago

This kind of errors are almost always caused by bad cgo code or by erroneous (racy) Go code. Usual questions:

  1. Does your program use cgo?
  2. Does compiling and then running it with the race detector enabled (-race) print any race warning?
neclepsio commented 6 years ago

@ALTree Thank you.

  1. Yes, it uses the now unsupported go-qml, with SjB modifications to make it work after go 1.6; the fact it does not crash under go 1.7.6 is maybe relevant to exclude problems in this field?
  2. No, there is no warning from race detector (sorry, reading https://github.com/golang/go/issues/22781#issuecomment-345405365, I though it was not relevant).
ALTree commented 6 years ago

the fact it does not crash under go 1.7.6 is maybe relevant to exclude problems in this field?

Unfortunately, I don't think it is. This kind of crashes are often not completely deterministic, so it's possible that a program that incorrectly uses cgo gets away with it in go1.7 by sheer luck, and then starts crashing in later releases due to changes in the runtime's internal implementation details, or other similar factors.

Considering you can't share the code and we don't have a reproducer, I don't think there's much that can be done here to investigate the issue or help you.

My guess is that the crash is induced by memory corruption caused by the cgo part of your program. If you could provide a race-free, cgo-free reproducer (like the one we had in #22781) then someone will certainly look into it, otherwise this issue is not really actionable.

RLH commented 6 years ago

A bit of a background might help. This check was added in order to detect accounting errors, in this case the GC has found more reachable objects than expected. This is typically caused when a pointer is hidden from the GC either due to a race condition, unsafe manipulations, or CGO hackery. The pointer is later resurrected and noticed by the runtime. The fact that no error was reported in earlier versions of Go is not an indication that this wasn't always a bug, it's just that the runtime just started flagging it.

Of course the bug may not be in the application at all but in the runtime or in the compiler. Unfortunately, since the problem happened far away from where the runtime found the accounting error a reproducer is probably needed to determine whether the bug is in the runtime or compiler.

Setting GOGC to a low number will induce many GCs and hopefully increase the chances that the bug will trigger. I have used this technique to decrease non-determinism and help create reproducers.

neclepsio commented 6 years ago

Thank you. In an attempt to make a cgo-free reproducer (which I'm far from), I discovered something that is maybe useful. I had this code:

ok := make(chan int)
go func() {
   // clear output folder
    ok <- 1
}()

go func() {
   // read 3d models and add to result channel
    ok <- 1
}()

go func() {
    // read text file
    <- ok
    <- ok
    // read from result channel and queue adding models to the view in OpenGL thread
}()

Just changing it to use a sync.WaitGroup instead of ok made the issue happen much less. From 40% of times (with a particular set of data), it now happens just 10% of times. There were no other changes in the code. I can also confirm that in all my tests I got no warning from race detector.

After that, I added a "rate limiter" to the goroutine adding models to the view (which runs in the OpenGL thread): before, it added the models as soon as they where sent to the (unbuffered) channel. Now the channel is buffered and the goroutine only adds the models to the view once per second. After that, I got no crashes.

neclepsio commented 6 years ago

@RLH thank you for your explaination, which I read after my message (which waited to be confirmed for several hours while I did my tests). You are right, after I changed GOGC the issue resurfaced. Since we suspect the issue is in cgo library, I don't know if the issue should be reopen or kept closed until I can write a reproducer.

ALTree commented 6 years ago

Since we suspect the issue is in cgo library, I don't know if the issue should be reopen or kept closed until I can write a reproducer.

Let's keep this closed, and open a new one when (if) you find a reasonably short reproducer.