libp2p / go-msgio

simple package to r/w length-delimited slices.
MIT License
12 stars 12 forks source link

panic: runtime error: slice bounds out of range #16

Closed iulianpascalau closed 4 years ago

iulianpascalau commented 5 years ago

Hello, got this weird issue once in hundred hours of intensive libp2p usages on a single machine out of a couple hundreds:

goroutine 24130 [running]:
github.com/libp2p/go-msgio.(*reader).ReadMsg(0xc0056c39f0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-msgio@v0.0.2/msgio.go:216 +0x25f
github.com/libp2p/go-libp2p-secio.(*etmReader).fill(0xc00645cd20, 0x634, 0x0)
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-libp2p-secio@v0.1.0/rw.go:132 +0x3f
github.com/libp2p/go-libp2p-secio.(*etmReader).Read(0xc00645cd20, 0xc005fd9c20, 0xc, 0xc, 0x0, 0x0, 0x0)
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-libp2p-secio@v0.1.0/rw.go:171 +0x264
io.ReadAtLeast(0x2c919c8, 0xc00034d440, 0xc005fd9c20, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
D:/Go/src/io/io.go:310 +0x8f
io.ReadFull(...)
D:/Go/src/io/io.go:329
github.com/libp2p/go-yamux.(*Session).recvLoop(0xc004a51860, 0x0, 0x0)
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-yamux@v1.2.2/session.go:499 +0xdb
github.com/libp2p/go-yamux.(*Session).recv(0xc004a51860)
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-yamux@v1.2.2/session.go:478 +0x32
created by github.com/libp2p/go-yamux.newSession
C:/Users/Sorin/go/pkg/mod/github.com/libp2p/go-yamux@v1.2.2/session.go:128 +0x38f

The go-msgio version we use is v0.0.2

Looking at the commits after this version I have found out that some improvements were done (including in nextMsgLen) func. This panic should have not occurred even in v0.0.2. I'm asking if there is a slight possibility to encounter this problem in a setting where the reader R might have been used by multiple go routines - for example reading/writing on the same stream by multiple go routines? Thanks

Stebalien commented 5 years ago

I'm asking if there is a slight possibility to encounter this problem in a setting where the reader R might have been used by multiple go routines - for example reading/writing on the same stream by multiple go routines? Thanks

I'm not sure if that's causing this issue but that's definitely going to cause problems if you're reading from multiple goroutines at the same time. msgio needs to atomically read the length and then the message.

However, this particular bug indicates that R is claiming to have read more data than it did. We had a bug in secio like this but that was fixed before 0.0.3.

What's the underlying reader?

iulianpascalau commented 5 years ago

What's the underlying reader?

I do not know what to answer to this. We'll try to update the libp2p libraries. Meanwhile, I can say that reading from a peer is done only on one go routine (same as in go-libp2p-pubsub) and we made writing concurrent safe by using a mutex for each peer.

Stebalien commented 5 years ago

What transport are you using?

iulianpascalau commented 5 years ago

TCP

Stebalien commented 5 years ago

Actually, looking at this a bit, we lock internally anyways so it shouldn't be an issue. I'm not sure what's going on here. It may be that the bug was exposed by a fix in 0.0.2 (we now truncate msg on short reads). But the underlying reader must be reporting the wrong length.

Stebalien commented 5 years ago

Do you have relay enabled, by any chance?

iulianpascalau commented 5 years ago

Do you have relay enabled, by any chance?

Yes. It is enabled by default:

opts := []libp2p.Option{
        libp2p.ListenAddrStrings(address),
        libp2p.Identity(p2pPrivKey),
        libp2p.DefaultMuxers,
        libp2p.DefaultSecurity,
        libp2p.ConnectionManager(conMgr),
        libp2p.DefaultTransports,
    }
Stebalien commented 5 years ago

Hm. So it could be a bug in any of our stream muxers as well. However, I can't find any obvious bugs. It could also be a memory corruption.

This is going to take a while to debug.

iulianpascalau commented 5 years ago

As stated, it is odd that only one of our early users encountered this. The machine is above average with i5, 16GB RAM and SSD running win10Pro. The other machines, around 138 running on AWS T3.medium with ubuntu never encountered this in the last 2 months. We will investigate further how to duplicate this error. Thanks.

Stebalien commented 5 years ago

Thanks! These kinds of bugs are honestly the most annoying for everyone involved. They hit when you don't expect and leave no trace.

iulianpascalau commented 5 years ago

Tell me about it! :)

LucianMincu commented 5 years ago

@Stebalien is there any update on this?

Stebalien commented 5 years ago

No. I've never seen this issue in the wild so it's a bit tricky to debug.

Unless this is a data race (quite possible), I'm guessing:

  1. We're using the circuit-relay transport here (enabled by default).
  2. There's a bug in one of our stream multiplexers (likely mplex).

I'm guessing the setup is:

yamux (a) -> secio (b) -> msgio (c) -> circuit-relay -> {mplex/yamux} (d) -> secio (e) -> msgio -> tcp

Why? Well, the bug is showing up in msgio (c) which is below yamux/secio (a/b). However, the bug likely isn't in msgio (c) as msgio is just trusting the return value from io.ReadFull. The bug could be in the TCP transport (assuming we're not using circuit-relay) but then it would likely be in the go standard library.

It's not in circuit-relay as circuit-relay just forwards reads. It could be in the second secio (e) but I'd expect a panic in (d) if that's where the bug is.

Therefore, I'm pretty sure the bug is in (d) (mplex/yamux). The next step is to dig through these to try to find out how they might claim to read more than the read buffer.

vyzo commented 5 years ago

this sounds dangerously like the secio bug we fixed though, have we eliminated the possibility of an unpatched secio being involved?

Stebalien commented 5 years ago

It does but secio 0.1.0 is being used (which has this fixed): github.com/libp2p/go-libp2p-secio@v0.1.0

Stebalien commented 5 years ago

But, looking at the stream implementations for yamux and mplex, I can't see how it could be either.

Stebalien commented 4 years ago

Closing as this hasn't been reproduced in a while. Please leave a comment if you're using the latest version of go-libp2p and hit this bug again.