Open richardlau opened 2 years ago
I will take care of that tomorrow morning.
Out of the box I just have to fine tune timers.
@nodejs/build Can I get access to that BSD machine?
Looks like #42846 reduced the error rate by some amount but it's still flaky. We can probably further increase the base timeout and see if that helps, even if the test will become unbearably slow.
I have a further change in #42812 which should reduce it even more. I'd suggest to wait for that to land and see what happens.
Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.
I have a further change in #42812 which should reduce it even more. I'd suggest to wait for that to land and see what happens.
Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.
This issue is tracking that the test is flaky, which it still appears to be so I'm reopening.
It's interesting that it seems to only happen on macOS and FreeBSD. I might run a few stress tests.
Patch (assuming the issue is timing-related):
diff --git a/test/parallel/test-http-server-headers-timeout-keepalive.js b/test/parallel/test-http-server-headers-timeout-keepalive.js
index 05531087ed83..ba82cb0e0c9c 100644
--- a/test/parallel/test-http-server-headers-timeout-keepalive.js
+++ b/test/parallel/test-http-server-headers-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
}, firstDelay + secondDelay).unref();
}
-const headersTimeout = common.platformTimeout(2000);
+const headersTimeout = common.platformTimeout(5000);
const server = createServer({
headersTimeout,
requestTimeout: 0,
diff --git a/test/parallel/test-http-server-request-timeout-keepalive.js b/test/parallel/test-http-server-request-timeout-keepalive.js
index 2466e1ee7a95..eb6d64b2f2bb 100644
--- a/test/parallel/test-http-server-request-timeout-keepalive.js
+++ b/test/parallel/test-http-server-request-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
}, firstDelay + secondDelay).unref();
}
-const requestTimeout = common.platformTimeout(2000);
+const requestTimeout = common.platformTimeout(5000);
const server = createServer({
headersTimeout: 0,
requestTimeout,
First run (master and with patch):
osx11 | freebsd12-x64 | |
---|---|---|
master | OK: 902 NOT OK: 98 TOTAL: 1000 |
OK: 863 NOT OK: 137 TOTAL: 1000 |
increased timeout | OK: 676 NOT OK: 324 TOTAL: 1000 |
OK: 1000 NOT OK: 0 TOTAL: 1000 |
Second run (master and with patch):
osx11 | freebsd12-x64 | |
---|---|---|
master | OK: 910 NOT OK: 90 TOTAL: 1000 |
OK: 819 NOT OK: 181 TOTAL: 1000 |
increased timeout | OK: 713 NOT OK: 287 TOTAL: 1000 |
OK: 1000 NOT OK: 0 TOTAL: 1000 |
Unless I'm mistaken, increasing the timeout fixes the issues on FreeBSD but somehow makes it worse on macOS?
I started a third run with fresh revisions because I find this hard to believe (master and with patch).
Update: Third run (master and with patch):
osx11 | freebsd12-x64 | |
---|---|---|
master | OK: 916 NOT OK: 84 TOTAL: 1000 |
OK: 838 NOT OK: 162 TOTAL: 1000 |
increased timeout | OK: 721 NOT OK: 279 TOTAL: 1000 |
OK: 1000 NOT OK: 0 TOTAL: 1000 |
Median success rate (refer to patch and separate tables above):
osx11 | freebsd12-x64 | |
---|---|---|
master (2000ms) | 91.0 % | 83.8 % |
patched (5000ms) | 71.3 % | 100 % |
@richardlau Thanks, I think one commit closed this by mistake.
@tniessen This is very strange. I code on MacOS and never got a failure...
@richardlau Based on https://github.com/nodejs/node/issues/42741#issuecomment-1110926314, would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS? Do GitHub actions respect our list of flaky tests?
@richardlau Based on #42741 (comment), would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS?
I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.
Do GitHub actions respect our list of flaky tests?
They should do: https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/.github/workflows/test-macos.yml#L34 https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/Makefile#L526-L530
I've no idea if it's the same underlying cause, but I've just spotted on LinuxONE: https://ci.nodejs.org/job/node-test-commit-linuxone/31851/nodes=rhel8-s390x/consoleFull
12:30:49 not ok 1273 parallel/test-http-server-request-timeout-keepalive
12:30:49 ---
12:30:49 duration_ms: 6.565
12:30:49 severity: fail
12:30:49 exitcode: 1
12:30:49 stack: |-
12:30:49 Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
12:30:49 at Proxy.mustCall (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:396:10)
12:30:49 at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-http-server-request-timeout-keepalive.js:77:27)
12:30:49 at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:438:15)
12:30:49 at Object.onceWrapper (node:events:641:28)
12:30:49 at Server.emit (node:events:527:28)
12:30:49 at emitListeningNT (node:net:1414:10)
12:30:49 at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
12:30:49 ...
I think this might have been an error from me, I have:
client.on('error', errOrEnd);
client.on('end', errOrEnd);
So being called two times is legit if things are not closed fast enough,
Should I mark this test and all tests in https://github.com/nodejs/node/pull/42893 (without raising the timeout)? They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...
(just to clarify: they pass on most platforms so I don't think we're ignoring a bug)
I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.
@richardlau I opened #42962 to mark the two known problematic tests as flaky on macOS. Increasing the timeout seems to fix the FreeBSD failures.
They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...
The strange part is that the error rate seems to increase when the timeout increases (see https://github.com/nodejs/node/issues/42741#issuecomment-1110926314), so I am not sure if macOS being slow is the issue here.
@tniessen You are right. I actually spotted a bug while working on #42812, see changes in https://github.com/nodejs/node/pull/42812/files#diff-bbcf7a115fe4680f3a7c076a487c89bd4ce935e7d2a726cfcb3f69331705e25eR260.
~Do you mind trying again first with and 2 and then 5 seconds once that lands?~
Both related PR are now in master. Can you please try again with 2 and/or 5 seconds?
@tniessen Any update on this? Are tests improving?
It seems the flake is still there: https://github.com/nodejs/reliability/issues/273
I just comprehended what had happened here.
The issue was mistakenly closed by cross-connection via pushing in another repository a commit having Fixes:
reference.
The push contained up to 567 latest commits (starting from 2022-04-01) from node
repository, rebased on top of test commits.
Checking with curl https://api.github.com/users/LiviaMedeiros/events?per_page=100 | jq '.[]|select(.type=="IssuesEvent")'
confirmed that only this issue was affected.
I deeply apologize for this embarassing incident. In the future, I'll keep anything like that on a separate GitHub account. Please let me know if there are other side effects or something to undo.
@aduh95 thanks for reopening this.
Pinging you as a TSC member (if I made a mistake
is applicable).
I got a failure in https://github.com/nodejs/node/pull/44741#issuecomment-1260454905
Getting some failures again on osx11 ([test-http-server-request-timeouts-mixed](https://ci.nodejs.org/job/node-test-pull-request/59813/, https://ci.nodejs.org/job/node-test-pull-request/59820/)
Test
test-http-server-request-timeout-keepalive
Platform
FreeBSD, macOS
Console output
Build links
Additional information
Test was recently updated by https://github.com/nodejs/node/pull/41263.
cc @ShogunPanda