Closed brandur closed 3 months ago
@bgentry I ran the test matrix 12x times for a total of 72 runs with zero failures. Pretty sure this works — what do you think?
The one thing that concerns me about this removal is that it feels like we're losing insight into what we're still waiting on, or which components may have failed and been restarted. For example, at any given moment, which components are healthy and which ones are having errors? Is there some kind of debug snapshot that could be printed or serialized with this info for all internal components?
I think we can probably find a way to work that into what we're replacing this with so long as we're mindful of it.
Awesome. Yeah I've got some ideas for that for sure — the first step will be to have services start to identify themselves so we can produce better error messages in case something failed to start or stop. We can go from there. Thanks!
We've made quite a few changes in recent months around winding stop/start into everything to the point where all services are now compliant with this concept, including the main client.
Quite a long time ago we put in a "component monitor" framework to help tests wait for clients to successfully start up. It worked reasonably well, but is a large amount of code, not very generic, and involves winding callback functions into all components, which is a little unsightly.
Here, I'm proposing that we replace the job component monitor was doing with a more simple wait on the
Started
channel of the client:This is made possible because similar to how the component monitor framework worked, the client doesn't signal that it's fully started until it's successfully waited for all its subservices to start up:
So all in all, we should be able to get similar guarantees for test purposes while needing less code and getting more reusability (in that non-client services can also use exactly the same code to wait for start up).