qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.1k stars 66 forks source link

fix(ws): fix nil pointer #1968

Closed Arqu closed 2 years ago

Arqu commented 2 years ago

We got a few segfaults on unsubscribeConn due to the id returning a nil set presumably due to emptying all connections without deleting the actual connection. Not worth keeping a full ref count on each and adding lots of code around it, it's enough to just garbage collect later here.

Arqu commented 2 years ago

@dustmop So I've been playing with this a lot more and the more I get into it the weirder it is. The original panic was

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4029d1]

goroutine 1471423 [running]:
github.com/qri-io/qri/lib/websocket.(*connections).unsubscribeConn(0xc002308ea0, 0xc010fe8420, 0x2e, 0xc0197c2000, 0x24)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:209 +0x185
github.com/qri-io/qri/lib/websocket.(*connections).removeConn(0xc002308ea0, 0xc0197c2000, 0x24)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:240 +0x16e
github.com/qri-io/qri/lib/websocket.(*connections).read(0xc002308ea0, 0xc0197c2000, 0x24, 0x0, 0x0)
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:262 +0x167
created by github.com/qri-io/qri/lib/websocket.(*connections).ConnectionHandler
\t/go/pkg/mod/github.com/qri-io/qri@v0.10.1-0.20211103120703-8e4267357bc3/lib/websocket/websocket.go:103 +0x2b9

Which refers to https://github.com/qri-io/qri/blob/8e4267357bc3ad98a6cb12aa8832d3d02b5ef8d9/lib/websocket/websocket.go#L209 Which makes no sense to me as none of those are pointers. Forcing races, manually injecting nils etc does not really force the issue to repeat. I've hacked together a test to get a repro, but couldn't.

Any ideas on your end? Otherwise maybe just be defensive about it and add the guards since real world != theoretical normal usage

dustmop commented 2 years ago

After looking at the code for a while, I don't have an explanation either for what the original stack trace is about. Quite mysterious. However, I think it would be useful to just add a test that imitates the supposed problem that this fix is covering. It should look like this:

  1. Construct a websocketHandler object
  2. Get keydata from testkeys
  3. add a nil pointer for the profileID of that test data
  4. call getConnID
  5. verify that the call returns an error
  6. verify that the map no longer contains that profileID
dustmop commented 2 years ago

Furthermore, I don't think we need a test that necessarily duplicates exactly what is happening to cause this crash in the wild. That is probably too difficult right now. As long as some tests exists that handles the code block being added, we will at least be able to refactor or modify the code in the future without accidentally breaking this behavior.