Closed CthulhuDen closed 2 months ago
This is excellent, thank you for the PR! I'm at a conference in Dublin until Friday, but I've taken a look and think I can see the error for handling of transient children.
For permanent children, the issue is more nuanced I think. We should expect that the max restart intensity will be reached quite quickly, since the child exits immediately. I suspect looking at the code however, that we only handle that case when a child dies during normal running - we behave differently (and incorrectly!) while the supervisor itself is starting up and spinning up its children. This is a tricky case, since nested supervisors need to be accounted for, but I agree we need to fix it.
I'll comment back once I've looked in more detail.
Question: what happens if you verify in both cases that the server goes down - which is what should happen now in both cases, because the max restart intensity should be reached quite quickly and bring the server down... That would be the correct behaviour in this situation as per the semantics/docs and is what an OTP supervisor would do.
I think the tests will still fail, because I think the code that sets up our monitors when we first spawn the children contains a race, at least for the starter process and handle paths of ToChildStart, but that's a different issue.
Not sure about the second test case, I think we might need a different verification step for that since the exit will be normal so we should verify that the child didn't restart instead or checking that it did (which is only the case for a transient that exits abnormally, which this isn't doing)
Note to self: we need to add a test case for that, to make sure we're removing the child even if it exits very fast on init
Question: what happens if you verify in both cases that the server goes down
Note that this is what happens with the permanentChildExceedsRestartsIntensity case already, so I think that case is covered. I think we should parameterise that case and inject all the child restart types that can trigger it and verify them independently, leaving the only test case here to be the original one you reported - verifying that we correctly handle a transient child that exits fast but exits normally, by removing it's child spec.
OK on investigation - bare in mind I'm looking at this on my smartphone though - I think there is a bug, but not the one we've been discussing. If you look at the test cases, they use the default supervisor definition, which is defined by restartOne, which in turn is defined in terms of:
restartOne = RestartOne defaultLimits
defaultLimits = limit (MaxR 1) (seconds 1)
So as soon as you exceed 1 restart within 1 seconds, the server will exit with MaxRestartIntensityReached
exit reason. That's expected behaviour. Please change the definition you're using in the test cases to allow for more restarts per timeout check, and set up the child process to eventually stop restarting and verify it does what you'd expect.
I do see some races for monitor setup during initialisation, which I'll add some test cases for and attempt to fix shortly. I think those warrant a separate bug report though. Also we have a reprise of a bug reported before the repos were split that starter pids can leak, and actually the ToChildStart instance for Process()
is fundamentally broken and I'll file a ticket to remove or replace it.
And do please shout if you disagree or can see something I've missed here - as I said I'm peering at the code on my phone so I could be talking utter rubbish! :)
Oh and in the transient child case, when we're not exceeding the restart limits, we should test that if the child dies during init we still remove it. Actually I think we should file a bug to to.add further tests for all the restart types validating what they do if the child crashes during init
Also see #8
I've been busy fixing distributed-process-client-server, but will be getting on to this shortly.
I'm working on reproducing this atm. Currently there are a few intermittently daily test cases in master before merging this patch, so I'm going to resolve those first.
This PR, along with others, went cold. My apologies for that. I've not been maintaining this library for a year or so, and whilst I want to get back to it, but I haven't been able to due to ill health this last year.
I believe this problem has gone away now that we've removed the ToChildStart
instance for Process ()
. If anyone can confirm then we can close this PR. If it's still an issue, I'll try and get access to a development environment and take a look.
Another update - I'm actively working on this again now. I will look to merge those test cases and see if we still have an issue.
This repository has moved. Since it looks like the issue was resolved, I will close this.
This PR is demonstration for what I believe to be an issue.
The supervisor appears to misbehave when worker exits very quickly (it only restarts once as failing cases now demonstrate, and in fact it restarts once even transient workers which exited normally though I couldn't demonstrate that clearly with tests).