nodejs / node-convergence-archive

Archive for node/io.js convergence work pre-3.0.0
https://github.com/nodejs/io.js/issues/2327
Other
1.84k stars 96 forks source link

[Converge] test: make cluster tests more time tolerant #38

Closed jasnell closed 9 years ago

jasnell commented 9 years ago

See: https://github.com/joyent/node/commit/f3f4e282168d243610f8e0241d633ff941c9c260

/cc @mhdawson

Original commit message:

simple tests test-cluster-master-error.js, test-cluster-master-kill.js
fails in AIX with assertion failure indicating that the workers were
alive even after the master terminated. A 200ms leeway is provided for
the workers to actually terminate, but the isAlive check returns
true in both the cases.

In AIX, the workers were actually terminating, but they took more time
- as much as 800ms (normal) to 1000ms (in rare cases).

Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective.

process.kill(pid) is used in the test case to check the liveliness of
the worker, and when the kill() call is issued, even if the target
process is in it's exit sequences, if the signal handlers are not shut
down, it will respond to external signals, causing those calls to pass.

This fix extends the additional timeout for all platforms
Fishrock123 commented 9 years ago

I think we now have better ways of specifying platform-specific timeouts. cc @silverwind

silverwind commented 9 years ago

Yes, there's two ways to achieve different timeouts per platform. First for individual timeout values inside tests with common.platformTimeout:

https://github.com/nodejs/io.js/blob/39dde3222e4733fc1b59c45e392d9ff1a88ae4cc/test/common.js#L183-L191

And for the overall test timeout (30s):

https://github.com/nodejs/io.js/blob/39dde3222e4733fc1b59c45e392d9ff1a88ae4cc/tools/test.py#L731-L734

jasnell commented 9 years ago

@mhdawson ... can you take a look to see if the io.js changes mentioned by @silverwind address the original issue. If they do, then we can likely close this.

Fishrock123 commented 9 years ago

@jasnell @mhdawson I checked and they don't seem to, however it would be better to use platform-specific timeouts that we have for these kinds of things than just raising the timeout.

Fishrock123 commented 9 years ago

Hmmm, this seems to be a special-case for AIX's exit() though, so I'm not sure this can be applied as a common thing?

Thoughts? We could just take the patch, but it also means the test will take over a second on all machines. :/

mhdawson commented 9 years ago

I agree that we don't necessarily just want to up the general timeout for AIX.

If I remember correctly I think we had started with a special case in the code for AIX (either for this PR or some others related to timeouts), but in the review it was suggested we just apply to all platforms to avoid the extra platform checks.

Fishrock123 commented 9 years ago

@mhdawson your commit log above does have some detailed info on it, yes.

In io.js we've preferred to checking the platform it seems.

rvagg commented 9 years ago

@mhdawson can I leave this for you to decide whether it needs to go in to nodejs/node prior to 4.0? I don't think it's that important cause it's only a test case but I'll leave it to you.

jasnell commented 9 years ago

just fyi, @mhdawson is out on vacation for this week. I agree with your assessment tho, so long as the current CI is passing, this shouldn't be a priority for 4.0

On Mon, Aug 24, 2015 at 12:21 AM, Rod Vagg notifications@github.com wrote:

@mhdawson https://github.com/mhdawson can I leave this for you to decide whether it needs to go in to nodejs/node prior to 4.0? I don't think it's that important cause it's only a test case but I'll leave it to you.

— Reply to this email directly or view it on GitHub https://github.com/nodejs/node-convergence-archive/issues/38#issuecomment-134070723 .

mhdawson commented 9 years ago

I'd like to be able to get it into the LTS stream, but provided I'll be able to do that later on I don't think its a blocker for 4.0. Based on the the discussion it sounds like I should just create a PR which makes the timeout longer only for AIX, right ?

Fishrock123 commented 9 years ago

@mhdawson Yes. I'd probably suggest adding AIX to common.platformTimeout() and then using that everywhere. (Unless this is super special case and AIX doesn't need it anywhere else, and nothing else needs it here.)

mhdawson commented 9 years ago

@Fishrock123 We have extended timeouts but not to the extent that I'd want to make all of the timeouts longer for AIX lat this point. I'd prefer to just as an AIX check to this test. If we end up having to change a significant number then I'd plan for us to change common.platformTimeout() as you suggest. Does this sound reasonable to you ?

Fishrock123 commented 9 years ago

@mhdawson Yep, that sounds fine. :)

mhdawson commented 9 years ago

@Fishrock123, @jasnell PR created on master: https://github.com/nodejs/node/pull/2891 could one of you review for me.

mhdawson commented 9 years ago

Landed on master closing this issue https://github.com/nodejs/node/commit/2853f9894fcecef5979d7ec2618c79760532253c