Closed tbonfort closed 3 years ago
@tbonfort, if you set -timeout=15s
or similar you should get a goroutine dump from the Go runtime. Could you paste the full goroutine dump into a <details>
block in a comment on this issue? The GDB backtrace appears to include only the subset of goroutines currently running on OS threads.
It isn't clear to me whether CL 303330 introduced a deadlock itself, exposed some other runtime bug (perhaps in defer
and recover
handling; compare #43941?) or exposed a latent bug in the test or one of the helper-libraries it uses. A full goroutine dump would help to determine which of those is the case.
@bcmills , sorry, I should have posted the go backtrace in the main issue. I didn't include it because it didn't contain any meaningful info for me aside of messages of the like of "goroutine running on other thread; stack unavailable". I'll post the exact output tomorrow UTC
Here is the dump:
panic: test timed out after 10s
goroutine 50 [running]:
testing.(*M).startAlarm.func1()
/home/tbonfort/dev/thirdparty/go/src/testing/testing.go:1788 +0x8e
created by time.goFunc
/home/tbonfort/dev/thirdparty/go/src/time/sleep.go:176 +0x32
goroutine 1 [running]:
goroutine running on other thread; stack unavailable
goroutine 6 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000fa280)
/home/tbonfort/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276 +0xad
created by go.opencensus.io/stats/view.init.0
/home/tbonfort/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:34 +0x8d
goroutine 35 [running]:
goroutine running on other thread; stack unavailable
created by testing.(*T).Run
/home/tbonfort/dev/thirdparty/go/src/testing/testing.go:1306 +0x35f
exit status 2
FAIL github.com/airbusgeo/godal 10.032s
Yikes, that's not as helpful as I was hoping. 😞 Thanks anyway — I'll dig into it more from here!
Some additional info that may be useful:
Some more debugging info:
using make(chan bool, 2)
on t.signal does not modify the hanging behavior
I added some fmt.Fprintf(os.Stderr, "START|DONE <- t.signal %p\n", t.signal)
calls around the lines sending/receiving on t.signal
, in t.Run and tRunner's defered func.
On "working tests", I get
=== RUN TestBandMask
START <- t.signal 0xc0000c21c0
--- PASS: TestBandMask (0.00s)
START: t.signal <- signal 0xc0000c21c0
DONE: t.signal <- signal 0xc0000c21c0
DONE <- t.signal 0xc0000c21c0
In the hanging case:
=== RUN TestOpenShared
START <- t.signal 0xc0003a00e0
--- PASS: TestOpenShared (0.00s)
START: t.signal <- signal 0xc0003a00e0
DONE: t.signal <- signal 0xc0003a00e0
DONE <- t.signal 0xc0003a00e0
=== RUN TestRegister
START <- t.signal 0xc0003a01c0
--- PASS: TestRegister (0.00s)
START: t.signal <- signal 0xc0003a01c0
/* hang */
Which seems strange as both goroutines seem to be blocking on read and write to the same channel.
running the test suite under valgrind causes a hang with the same t.signal locking pattern, but in a subsequent test (i.e. a few tests after the TestRegister one). Outside of valgrind, the test suite consistently hangs right after TestRegister.
Checking in on this issue, as it's labeled as a release blocker.
I'm actively looking into it today.
Using libgdal 3.2.2+dfsg-3
(installed via apt
) I see a test failure in TestVSIGCS
under go1.16.8
:
~/tmp/godal$ go1.16.8 test . -timeout=20s
ERROR 10: Pointer 'hSRS1' is NULL in 'OSRIsSame'.
ERROR 10: Pointer 'hSRS2' is NULL in 'OSRIsSame'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRIsGeographic'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityCode'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityName'.
ERROR 10: Pointer 'hSRS' is NULL in 'OSRAutoIdentifyEPSG'.
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS
ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame
ERROR 10: Pointer 'hGeom' is NULL in 'OGR_G_IsEmpty'.
--- FAIL: TestVSIGCS (0.22s)
godal_test.go:3044: new reader for gs://godal-ci-data/test.tif: Get "https://storage.googleapis.com/godal-ci-data/test.tif": oauth2: cannot fetch token: 400 Bad Request
Response: {
"error": "invalid_grant",
"error_description": "Bad Request"
}
OGRCT: Wrap target at 50.
ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.
ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'.
FAIL
FAIL github.com/airbusgeo/godal 2.441s
FAIL
I can reproduce the reported deadlock using both go1.17
and gotip
.
This only requires two test functions to reproduce, but it does require two test functions:
~/tmp/godal$ gotip test -run='TestOpenShared' -timeout=12s
PASS
ok github.com/airbusgeo/godal 0.097s
~/tmp/godal$ gotip test -run='TestRegister' -timeout=12s
PASS
ok github.com/airbusgeo/godal 0.102s
~/tmp/godal$ gotip test -run='Test(OpenShared|Register)' -timeout=12s
panic: test timed out after 12s
goroutine 18 [running]:
testing.(*M).startAlarm.func1()
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1970 +0x8e
created by time.goFunc
/usr/local/google/home/bcmills/sdk/gotip/src/time/sleep.go:176 +0x32
goroutine 1 [running]:
goroutine running on other thread; stack unavailable
goroutine 6 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000fa280)
/usr/local/google/home/bcmills/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276 +0xad
created by go.opencensus.io/stats/view.init.0
/usr/local/google/home/bcmills/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:34 +0x8d
goroutine 9 [running]:
goroutine running on other thread; stack unavailable
created by testing.(*T).Run
/usr/local/google/home/bcmills/sdk/gotip/src/testing/testing.go:1439 +0x35f
exit status 2
FAIL github.com/airbusgeo/godal 12.095s
Change https://golang.org/cl/351532 mentions this issue: testing: add debugging assertions
This looks like either memory corruption from something in the C library or a bug in the Go compiler. Digging a bit further to figure out which.
Using
libgdal 3.2.2+dfsg-3
(installed viaapt
) I see a test failure inTestVSIGCS
undergo1.16.8
:~/tmp/godal$ go1.16.8 test . -timeout=20s ERROR 10: Pointer 'hSRS1' is NULL in 'OSRIsSame'. ERROR 10: Pointer 'hSRS2' is NULL in 'OSRIsSame'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRIsGeographic'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetSemiMajor'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityCode'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRGetAuthorityName'. ERROR 10: Pointer 'hSRS' is NULL in 'OSRAutoIdentifyEPSG'. ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame ERROR 1: PROJ: proj_get_ellipsoid: CRS has no geodetic CRS ERROR 1: PROJ: proj_get_ellipsoid: Object is not a CRS or GeodeticReferenceFrame ERROR 10: Pointer 'hGeom' is NULL in 'OGR_G_IsEmpty'. --- FAIL: TestVSIGCS (0.22s) godal_test.go:3044: new reader for gs://godal-ci-data/test.tif: Get "https://storage.googleapis.com/godal-ci-data/test.tif": oauth2: cannot fetch token: 400 Bad Request Response: { "error": "invalid_grant", "error_description": "Bad Request" } OGRCT: Wrap target at 50. ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'. ERROR 10: Pointer 'hBand' is NULL in 'GDALGetMaskBand'. FAIL FAIL github.com/airbusgeo/godal 2.441s FAIL
I can reproduce the reported deadlock using both
go1.17
andgotip
.
@bcmills I firmly believe that failing test is unrelated to the issue at hand, but is due to incorrect credential handling in another part of the code. (i.e. they'll fail identically if you use go <1.17). The ERROR:XXX messages are expected and due to the test suite performing error-raising cases.
This looks like either memory corruption from something in the C
With go < 1.17 (I don't have enough feedback for 1.17) things have been running smoothely with same codebase. The valgrind output before the hang is "clean", i.e not showing errors in anything related to the C code
Looks like memory corruption from the C side. It looks like something in the unsafe
code is freeing a Go-owned zero-page to the C heap, which is overwriting the zero-page with nonzero bytes.
Then the make(chan bool, 1)
in (*testing.T).Run
ends up allocating a channel with capacity 1
and initial length 0x6363636363636363
, instead of the initial length 0
that it is supposed to have, and sending to a 1-buffered channel that already contains 0x6363636363636363 values blocks. 😅
I don't really have the time to debug how the zeroed Go pages are getting corrupted (I can't set a watchpoint easily because the value hasn't even been allocated yet), but at this point I'm confident enough to rule out a bug in the Go compiler or the testing
package.
You might be able to get more mileage out of valgrind
using the Go -msan
flag. Good luck!
(Also, note that you might be able to use the new unsafe.Slice
and unsafe.Add
added in Go 1.17 to simplify this code...)
This issue can be closed as invalid. The OpenShared testcase was asking the cgo lib to populate 300 bytes of data inside a slice we had only sized to 100 bytes. I'm actually confused as to why this didn't crash in previous go versions. Thanks @bcmills for the investigation, and sorry for erroneously dragging you into an issue that's entirely unrelated with the bisected commit.
I have a couple of notes from debugging.
This was probably masked on earlier Go versions because the same memory corruption manifested as an undetected goroutine leak within the testing
package.
You could add an extra margin of safety to the cBuffer
helper function by passing in the expected dimensions and then slicing the buffer to those dimensions — that will cause a run-time panic (instead of memory corruption) if the passed-in buffer is too small.
https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal.go#L1477 and similar lines might become something like:
cBuf = unsafe.Pointer(&buf[:bufWidth*bufHeight*nBands][0])
Thanks for the suggestion, I was also wondering if I should add some extra checks for that buffer, as it is error-prone (there were a couple of other tests that had similar buffer sizing errors). The actual minimal buffer size is a bit more tricky to calculate than just bufWidth*bufHeight*nBands
as it is also possible to configure the pixel, line and band strides (i.e. spacing from one pixel | line | band to the next).
While I have your attention, do you have an idea why the gs:// test failed with a 400 error on your side? It has been raised by the google cloud api directly, via https://github.com/airbusgeo/osio/blob/main/gcs.go#L63 and https://github.com/airbusgeo/osio/blob/main/gcs.go#L81 (I would have expected a 403 or 404 error if you had credentials available)
No idea what's up with the 400 error, unfortunately. 🤷♂️
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
My cgo testsuite deadlocks, commenting out a seemingly random cgo call from 2 tests before the deadlock happens fixes the issue.
Failing test run:
Commenting out this line in the
TestOpenShared
test fixes the deadlock: https://github.com/airbusgeo/godal/blob/00980a3df723677b2d949a10eedd78ab6c5d6aee/godal_test.go#L1242Here is a gdb stacktrace when deadlocked:
I have bisected the issue down to 1c590661e7d3b477662f76ead56f39567ea8345a (from @bcmills). Reverting this specific commit in master fixes the deadlock.