knadh / niltalk

Instant, disposable, single-binary web based live chat server. Go + VueJS.
GNU Affero General Public License v3.0
952 stars 117 forks source link

Incorrect room destroying #9

Closed inmylo closed 7 years ago

inmylo commented 7 years ago

Hi,

I'm trying to use your multiroom chat example and I've found a problem with room destroying. Currently niltalk works like this:

  1. room is created;
  2. peers are connected;
  3. communication;
  4. stop signal to close the room;
  5. peers are disconnected;
  6. room is destroyed.

But in the Peer.listen() function you have a defer function:

defer func() {
        if _, exists := rooms[peer.room_id]; exists {
            rooms[peer.room_id].unregister <- peer
        }
        peer.ws.Close()
    }()

I often (and randomly) get a situation when this defer function is executed after the room is destroyed - so unregister channel is already closed and I get a runtime error write to the closed channel, sometimes my program even crashes after this error.

Please help to solve this. What would be the most optimal solution?

I've tried to add a room peer's counter and close when counter == 0, but counter checking in a loop freezes the process, so the solution is not working. Adding a timeout before closing the room is probably the worst solution.

knadh commented 7 years ago

Seems like a race condition although I've never seen it happen. Are you able to pin point when this happens?

if _, exists := rooms[peer.room_id]; exists { If a room is already destroyed, this bit in the defer() should never evaluate to true, and thus, not try and delete the room again.

inmylo commented 7 years ago

I think I have an idea why it happens. You can try to add to your code just 1 line - in your func (room *Room) dispose(status int) add a timeout before closing channels:

...
time.Sleep(5*time.Second)

// Close all the room channels.
close(room.broadcastQueue)
close(room.register)
...

..then go to your app, create a room, enter it, press dispose button.

After 5 seconds you will get an error:

DEBUG: 2016/11/18 22:51:39 Starting room gnVRQ
DEBUG: 2016/11/18 22:51:41 User1@eWIYue connected

DEBUG: 2016/11/18 22:51:53 Stopped room gnVRQ
2016/11/18 22:51:58 http: panic serving 127.0.0.1:59756: send on closed channel
goroutine 41 [running]:
net/http.(*conn).serve.func1(0xc420162a00)
    /path1/go/src/net/http/server.go:1491 +0x12a
panic(0x7238c0, 0xc4201d6150)
    /path1/go/src/runtime/panic.go:458 +0x243
main.(*Peer).listen.func1(0xc420196a10)
    /path2/src/github.com/user/niltalk/conn.go:277 +0xe0
main.(*Peer).listen(0xc420196a10)
...
...
...

In my code instead of this timeout there are another functions inside dispose(), so they are making a delay and I get this error.

I've tried to move my code before deleting all peers, before your for peer := range room.peers { loop, but I still sometimes get this error. Probably it depends on how fast different "threads" are being executed.

As I understand this is what's happening: 1) room is created; 2) peer is connected; 3) room.stop <- 0 signal to stop the room, so room.run() loop stops -> noone listens on the unregister channel any more, but room still exists; 4) peer is disconnected -> his listen.defer func is being executed and because room still exists -

if _, exists := rooms[peer.room_id]; exists {

is True and then:

rooms[peer.room_id].unregister <- peer

... here process freezes because noone listens that channel; 5) all room's channels are deleted and room itself; 6) you get an error about closed channel.

inmylo commented 7 years ago

What about clearing unregister channel before closing it?

...
// Clear channel to make sure all peers can disconnect
for _ = range room.unregister {}
close(room.unregister)
...

Looks like it works for me

knadh commented 7 years ago

You are right. I was able to reproduce it with the time.sleep() hack. https://github.com/knadh/niltalk/commit/10c6b177c2876b198c8bcf69be1189f471c59775 fixes the issue. When dispose() blocks (for example, with the time.Sleep) and peers are disconnecting in the meanwhile, they should not block trying to write to the unregister channel. The select-default in the new commit addresses this.