riverqueue / river

Fast and reliable background jobs in Go
https://riverqueue.com
Mozilla Public License 2.0
3.51k stars 94 forks source link

Remove component status infrastructure in favor of reusing start/stop #436

Closed brandur closed 3 months ago

brandur commented 3 months ago

We've made quite a few changes in recent months around winding stop/start into everything to the point where all services are now compliant with this concept, including the main client.

Quite a long time ago we put in a "component monitor" framework to help tests wait for clients to successfully start up. It worked reasonably well, but is a large amount of code, not very generic, and involves winding callback functions into all components, which is a little unsightly.

Here, I'm proposing that we replace the job component monitor was doing with a more simple wait on the Started channel of the client:

startClient(ctx, t, client)
riversharedtest.WaitOrTimeout(t, client.baseStartStop.Started())

This is made possible because similar to how the component monitor framework worked, the client doesn't signal that it's fully started until it's successfully waited for all its subservices to start up:

func (c *Client[TTx]) Start(ctx context.Context) error {
    ...

    go func() {
        // Wait for all subservices to start up before signaling our own start.
        // This isn't strictly needed, but gives tests a way to fully confirm
        // that all goroutines for subservices are spun up before continuing.
        //
        // Stop also cancels the "started" channel, so in case of a context
        // cancellation, this statement will fall through. The client will
        // briefly start, but then immediately stop again.
        startstop.WaitAllStarted(append(
            c.services,
            producerServices..., // see comment on this variable
        )...)

        started()
        defer stopped()

So all in all, we should be able to get similar guarantees for test purposes while needing less code and getting more reusability (in that non-client services can also use exactly the same code to wait for start up).

brandur commented 3 months ago

@bgentry I ran the test matrix 12x times for a total of 72 runs with zero failures. Pretty sure this works — what do you think?

bgentry commented 3 months ago

The one thing that concerns me about this removal is that it feels like we're losing insight into what we're still waiting on, or which components may have failed and been restarted. For example, at any given moment, which components are healthy and which ones are having errors? Is there some kind of debug snapshot that could be printed or serialized with this info for all internal components?

I think we can probably find a way to work that into what we're replacing this with so long as we're mindful of it.

brandur commented 3 months ago

Awesome. Yeah I've got some ideas for that for sure — the first step will be to have services start to identify themselves so we can produce better error messages in case something failed to start or stop. We can go from there. Thanks!