lxzan / gws

simple, fast, reliable websocket server & client, supports running over tcp/kcp/unix domain socket. keywords: ws, proxy, chat, go, golang...
https://pkg.go.dev/github.com/lxzan/gws
Apache License 2.0
1.34k stars 84 forks source link

Nil pointer dereference crash #89

Closed goriunov closed 3 months ago

goriunov commented 3 months ago

Looks like there is nil pointer dereference, our servers have been running stably for the last week and suddenly today we got this error (running on latest version of the code set), not yet sure what the full trigger conditions to reproduce:

[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x105100c9c]

goroutine 30724 [running]:
github.com/lxzan/gws.(*Conn).emitError(0x0, {0x1057966e0, 0xc0282fc558})
    /Users/ss/go/pkg/mod/github.com/lxzan/gws/conn.go:113 +0x3c
github.com/lxzan/gws.(*Conn).WriteClose(0x0, 0x3e8, {0xc003d6ffaf, 0x0, 0x0?})
    /Users/ss/go/pkg/mod/github.com/lxzan/gws/writer.go:23 +0x178
github.com/internal/utils.(*WebsocketClient).Shutdown(...)
...
exit status 2

it looks like the emitError has a nil pointer dereference but not yet sure how this got to this, our code simply calls:

wc.conn.WriteClose(1000, []byte{})
lxzan commented 3 months ago

Are you using gws v1.8.3 ?

Do you give complete information about the error stack? I don't see panic

The server configuration requires you to provide

goriunov commented 3 months ago

Yes we are on v1.8.3 sorry missed the first line of the log it looks as follow:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x105100c9c]

goroutine 30724 [running]:
github.com/lxzan/gws.(*Conn).emitError(0x0, {0x1057966e0, 0xc0282fc558})
    /Users/ss/go/pkg/mod/github.com/lxzan/gws/conn.go:113 +0x3c
github.com/lxzan/gws.(*Conn).WriteClose(0x0, 0x3e8, {0xc003d6ffaf, 0x0, 0x0?})
    /Users/ss/go/pkg/mod/github.com/lxzan/gws/writer.go:23 +0x178
github.com/internal/utils.(*WebsocketClient).Shutdown(...)
...
exit status 2

The rest of the stack is purely our internal code before we call the shutdown internal function in our wrapper, the panic actually happened in the c.emitError so I assumed the internal references would not help without really knowing the internal code and it does not really contain anything that would cause the panic (unfortunate that code is also private so i am not able to share it)

We have purely client connection (not a server) here is the configurations we use for the client connection, unfortunately can not provide an actual address:

&gws.ClientOption{
        Addr: ""
        PermessageDeflate: gws.PermessageDeflate{
            Enabled:               true,
            ServerContextTakeover: true,
            ClientContextTakeover: true,
        },
        ReadBufferSize:  128 * 1024,
        WriteBufferSize: 128 * 1024,
}

The issue happened only once today after server restarted by itself everything is back to normal. I don't think it is related to the internal part of the code as it definitely can reach the c.emitError() function from the gws lib.

lxzan commented 3 months ago

github.com/lxzan/gws.(*Conn).emitError(0x0, {0x1057966e0, 0xc0282fc558})

The first parameter is 0x0 which means Conn has become nil, did you assign *gws.Conn to null pointer?

goriunov commented 3 months ago

No i do not have the nill assignment any where for existing connections, also that should not be possible to do that as calling:

WriteClose() is a sync function, and it must be called on the Conn reference. The only way it could have been the case, if somehow the nill was assigned in between calling the WriteClose and calling the emitError under the WriteClose (which does not seem right)

goriunov commented 3 months ago

Actually if we look at the WriteClose it already had a nil pointer as a first parameter, that is very interesting how was that function even trigger on the nil pointer (🤔 ). If that was actually a nil it should have shouted at the WriteClose call

lxzan commented 3 months ago

Your question is very strange, normally a null pointer exception would look like this

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x90 pc=0x63d23d]

goroutine 1 [running]:
github.com/lxzan/gws.(*Conn).emitError(0x7f08a53757c8?, {0x730100?, 0xc0000aa0d8?})
        C:/msys64/home/lixizan/Open/gws/conn.go:111 +0x3d
github.com/lxzan/gws.(*Conn).WriteClose(0x0, 0x3e8, {0xc0000a6f27, 0x0, 0x0?})
        C:/msys64/home/lixizan/Open/gws/writer.go:23 +0x111
main.main()
        C:/msys64/home/lixizan/Open/gws/examples/echo/main.go:12 +0x28
lxzan commented 3 months ago

We need to take the time to study the reproduction mechanism

goriunov commented 3 months ago

After digging through code i found one case where the conn could be nil if the connection was not properly established, it is very interesting to me though that i am able to call the WriteClose on the nil pointer, and then it falls down to emitError that crashes (was expecting it to be crashing at the place where)...

Learning every day :)

goriunov commented 3 months ago

For future readers, the error is possible if you have:

type WebsocketClient struct {
    conn             *gws.Conn
}

// at this point the conn == nil
v := WebsocketClient{}

// the function will be triggered and stack trace with crash will point to the c.emitError
// and not to this line directly which confused the hell out of me
v.conn.WriteClose(1000, []byte{})
lxzan commented 3 months ago
receiver.do(args) <=> do(receiver, args)

Not panic without accessing the receiver's attributes

lxzan commented 3 months ago

For future readers, the error is possible if you have:

type WebsocketClient struct {
  conn             *gws.Conn
}

// at this point the conn == nil
v := WebsocketClient{}

// the function will be triggered and stack trace with crash will point to the c.emitError
// and not to this line directly which confused the hell out of me
v.conn.WriteClose(1000, []byte{})

I'm confused as to why the error doesn't point to line 111.

lxzan commented 3 months ago

*gws.Conn is a null pointer causing panic, that means it's not a bug in gws, you should check the business code first

goriunov commented 3 months ago

Yep as mentions in the previous message i have found an issue in our side, not the gws lib. So all good for now!