philippseith / signalr

SignalR server and client in go
MIT License
131 stars 39 forks source link

Possible logic error in Stop() #189

Closed sruehl closed 7 months ago

sruehl commented 9 months ago

Looking at the Stop method in client.go:220 I was wondering if the logic is wrong. At the moment the state is being set immediately:

func (c *client) Stop() {
    if c.cancelFunc != nil {
        c.cancelFunc()
    }
    c.setState(ClientClosed)
}

which should rather be

func (c *client) Stop() {
    if c.cancelFunc != nil {
        c.cancelFunc()
        ctx,cancel := context.WithTimeout(context.Background(), 5*time.Second) // in practice it is faster than 5 seconds so this is just to avoid infinite block
        defer cancel()
        c.WaitForState(ctx, ClientClosed)
    }
}

Rational: When calling stop you really want the client to stop. Stopped means that no more go routine is running

Effect: At the moment when you redirect the logger to t.Testing Log you get Dataraces when run with -race

type testLogWrapper struct {
    *testing.T
}

func (t testLogWrapper) Log(args ...any) error {
    t.T.Helper()
    t.T.Log(args...)
    return nil
}

I created a PR for that to show the issue with a proper test case https://github.com/philippseith/signalr/pull/188. (Even with the updated code I run into a timeout)

philippseith commented 8 months ago

Effect: At the moment when you redirect the logger to t.Testing Log you get Dataraces when run with -race

IMHO this could be a problem of testing.T.Log, not of the client

Rational: When calling stop you really want the client to stop. Stopped means that no more go routine is running

  1. Did you observe that the client didn't stop?
  2. After calling Stop, the Client state is not Stopped (there is no state like this), but ClientClosed, which is defined as:
    //  ClientClosed
    //
    // The Client is not able to send and receive messages anymore and has to be started again to be able to.

    Did you observe that this was not the case after calling Stop?

sruehl commented 8 months ago

IMHO this could be a problem of testing.T.Log, not of the client

Yes it is a problem of T but it is also a symptom of go routines still running.

  1. Did you observe that the client didn't stop?

Yes by having goroutines still logging

Did you observe that this was not the case after calling Stop?

Yes because the client just sets the state to ClientClosed when calling Stop.

Therefore I created #188 to fix the problem. In the original version the test works but with the fix attempt the test fails there (So test itself is good to be used upstream).

philippseith commented 8 months ago

Stop() can't wait for all goroutines ended, because the goroutines involved in the client are not build that way. They all run on a select with a case <-ctx.Done(): if they are idle (what they will become inevitably). So they will end after the runtime has checked all the select statements. But this does not happen in zero-time, especially when the goroutine where Stop() was called still runs.

It looks like #188 does just delay the return of the function. It ends when the WaitForState returns. But the WaitForState mechanism involves goroutines and channels, so the necessary context switches can take place and the goroutines are ending.

I don't see any benefit in making Stop() synchronous. If it is called, they client restrains to execute any user requests and all resources used by the it are freed after a time. This is all you can expect in a garbage collecting runtime like go.

It would be possible to add special sync.WaitGroup to all started goroutines to make it safely synchronous, but as I said, I don't see the benefit over a simple <-time.After(50*time.Microsecond)after calling Stop()

sruehl commented 8 months ago

Stop() can't wait for all goroutines ended, because the goroutines involved in the client are not build that way. They all run on a select with a case <-ctx.Done(): if they are idle (what they will become inevitably). So they will end after the runtime has checked all the select statements. But this does not happen in zero-time, especially when the goroutine where Stop() was called still runs.

yes and it would be good if we have a sync.WaitGroup like you describe below.

It looks like #188 does just delay the return of the function. It ends when the WaitForState returns. But the WaitForState mechanism involves goroutines and channels, so the necessary context switches can take place and the goroutines are ending.

I thought so, hence the PR is still a WIP as we need indeed a sync.WaitGroup to check if all spawned goroutines are gone.

I don't see any benefit in making Stop() synchronous. If it is called, they client restrains to execute any user requests and all resources used by the it are freed after a time. This is all you can expect in a garbage collecting runtime like go.

One benefit would be that there are no go routines running anymore after you call Stop(). In certain usecases this can leak Goroutines and the simpler case is that *testing.T dies when something writes to t.Log after the test is done. This makes sense and is the case when you redirect logs like I wrote above.

It would be possible to add special sync.WaitGroup to all started goroutines to make it safely synchronous, but as I said, I don't see the benefit over a simple <-time.After(50*time.Microsecond)after calling Stop()

I listed some benefits above (no leaking goroutines, *testing.T log redirect works). I consider it bad to work with sleeps or some kind as especially under different machines and load scenarios it is not deterministic if it is > or < 50ms for example. While it is doable to just do a sleep 5 seconds after calling Stop() or cancel the context I consider it not a good way to check if something has properly stopped.

philippseith commented 8 months ago

One benefit would be that there are no go routines running anymore after you call Stop(). In certain usecases this can leak Goroutines

There are no leaking goroutines. The all come to an end when the go runtime scheduler checks them the next time, as they all on a select statement when no signalr protocol work has to be done. But the order the scheduler does this and so the time they when all are ended can not be determined exactly.

What remains is the panic or race in testing. If you look at the code in testing.go, you will see that both results are in there to help detect not appropriately synchronized goroutines. But for signalr this is a false positive.

If you like, you can try to implement the sync.WorkGroup idea. But keep in mind that client and server share some code.

sruehl commented 8 months ago

I will add the sync.WorkGroup the upcoming days

sruehl commented 8 months ago

I updated #188 and added WaitGroup additions rarely where it made direct sense. On other go routines that were spawned as a direct result of a client action (Client should not really call those after Stop) I left them out. There were some places where I was unsure. I added a test to check if at least we don't get some log messages (might be incomplete as it doesn't test the full lifecycle)