Closed Arqu closed 2 years ago
@dustmop So I've been playing with this a lot more and the more I get into it the weirder it is. The original panic was
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4029d1]
goroutine 1471423 [running]:
github.com/qri-io/qri/lib/websocket.(*connections).unsubscribeConn(0xc002308ea0, 0xc010fe8420, 0x2e, 0xc0197c2000, 0x24)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:209 +0x185
github.com/qri-io/qri/lib/websocket.(*connections).removeConn(0xc002308ea0, 0xc0197c2000, 0x24)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:240 +0x16e
github.com/qri-io/qri/lib/websocket.(*connections).read(0xc002308ea0, 0xc0197c2000, 0x24, 0x0, 0x0)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:262 +0x167
created by github.com/qri-io/qri/lib/websocket.(*connections).ConnectionHandler
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:103 +0x2b9
Which refers to https://github.com/qri-io/qri/blob/8e4267357bc3ad98a6cb12aa8832d3d02b5ef8d9/lib/websocket/websocket.go#L209 Which makes no sense to me as none of those are pointers. Forcing races, manually injecting nils etc does not really force the issue to repeat. I've hacked together a test to get a repro, but couldn't.
Any ideas on your end? Otherwise maybe just be defensive about it and add the guards since real world != theoretical normal usage
After looking at the code for a while, I don't have an explanation either for what the original stack trace is about. Quite mysterious. However, I think it would be useful to just add a test that imitates the supposed problem that this fix is covering. It should look like this:
Furthermore, I don't think we need a test that necessarily duplicates exactly what is happening to cause this crash in the wild. That is probably too difficult right now. As long as some tests exists that handles the code block being added, we will at least be able to refactor or modify the code in the future without accidentally breaking this behavior.
We got a few segfaults on
unsubscribeConn
due to the id returning a nil set presumably due to emptying all connections without deleting the actual connection. Not worth keeping a full ref count on each and adding lots of code around it, it's enough to just garbage collect later here.