nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
105.23k stars 28.5k forks source link

Investigate flaky test-http-server-request-timeout-keepalive #42741

Open richardlau opened 2 years ago

richardlau commented 2 years ago

Test

test-http-server-request-timeout-keepalive

Platform

FreeBSD, macOS

Console output

17:13:37 not ok 1424 parallel/test-http-server-request-timeout-keepalive
17:13:37   ---
17:13:37   duration_ms: 3.487
17:13:37   severity: fail
17:13:37   exitcode: 1
17:13:37   stack: |-
17:13:37     node:assert:123
17:13:37       throw new AssertionError(obj);
17:13:37       ^
17:13:37     
17:13:37     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
17:13:37     + actual - expected
17:13:37     
17:13:37     + 'HTTP/1.1 200 OK\r\n' +
17:13:37     +   'Content-Type: text/plain\r\n' +
17:13:38     +   'Date: Thu, 14 Apr 2022 16:13:37 GMT\r\n' +
17:13:38     +   'Connection: close\r\n' +
17:13:38     +   'Transfer-Encoding: chunked\r\n' +
17:13:38     +   '\r\n' +
17:13:38     +   '0\r\n' +
17:13:38     +   '\r\n'
17:13:38     - ''
17:13:38         at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd12-x64/test/parallel/test-http-server-request-timeout-keepalive.js:82:12)
17:13:38         at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd12-x64/test/common/index.js:438:15)
17:13:38         at Socket.emit (node:events:539:35)
17:13:38         at endReadableNT (node:internal/streams/readable:1344:12)
17:13:38         at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
17:13:38       generatedMessage: true,
17:13:38       code: 'ERR_ASSERTION',
17:13:38       actual: 'HTTP/1.1 200 OK\r\n' +
17:13:38         'Content-Type: text/plain\r\n' +
17:13:38         'Date: Thu, 14 Apr 2022 16:13:37 GMT\r\n' +
17:13:38         'Connection: close\r\n' +
17:13:38         'Transfer-Encoding: chunked\r\n' +
17:13:38         '\r\n' +
17:13:38         '0\r\n' +
17:13:38         '\r\n',
17:13:38       expected: '',
17:13:38       operator: 'strictEqual'
17:13:38     }
17:13:38     
17:13:38     Node.js v18.0.0-pre
17:13:38   ...

Build links

Additional information

Test was recently updated by https://github.com/nodejs/node/pull/41263.

cc @ShogunPanda

ShogunPanda commented 2 years ago

I will take care of that tomorrow morning.

Out of the box I just have to fine tune timers.

@nodejs/build Can I get access to that BSD machine?

tniessen commented 2 years ago

Looks like #42846 reduced the error rate by some amount but it's still flaky. We can probably further increase the base timeout and see if that helps, even if the test will become unbearably slow.

ShogunPanda commented 2 years ago

I have a further change in #42812 which should reduce it even more. I'd suggest to wait for that to land and see what happens.

Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.

richardlau commented 2 years ago

I have a further change in #42812 which should reduce it even more. I'd suggest to wait for that to land and see what happens.

Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.

This issue is tracking that the test is flaky, which it still appears to be so I'm reopening.

tniessen commented 2 years ago

It's interesting that it seems to only happen on macOS and FreeBSD. I might run a few stress tests.

tniessen commented 2 years ago

Patch (assuming the issue is timing-related):

diff --git a/test/parallel/test-http-server-headers-timeout-keepalive.js b/test/parallel/test-http-server-headers-timeout-keepalive.js
index 05531087ed83..ba82cb0e0c9c 100644
--- a/test/parallel/test-http-server-headers-timeout-keepalive.js
+++ b/test/parallel/test-http-server-headers-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
   }, firstDelay + secondDelay).unref();
 }

-const headersTimeout = common.platformTimeout(2000);
+const headersTimeout = common.platformTimeout(5000);
 const server = createServer({
   headersTimeout,
   requestTimeout: 0,
diff --git a/test/parallel/test-http-server-request-timeout-keepalive.js b/test/parallel/test-http-server-request-timeout-keepalive.js
index 2466e1ee7a95..eb6d64b2f2bb 100644
--- a/test/parallel/test-http-server-request-timeout-keepalive.js
+++ b/test/parallel/test-http-server-request-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
   }, firstDelay + secondDelay).unref();
 }

-const requestTimeout = common.platformTimeout(2000);
+const requestTimeout = common.platformTimeout(5000);
 const server = createServer({
   headersTimeout: 0,
   requestTimeout,

First run (master and with patch):

osx11 freebsd12-x64
master OK: 902 NOT OK: 98 TOTAL: 1000 OK: 863 NOT OK: 137 TOTAL: 1000
increased timeout OK: 676 NOT OK: 324 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000

Second run (master and with patch):

osx11 freebsd12-x64
master OK: 910 NOT OK: 90 TOTAL: 1000 OK: 819 NOT OK: 181 TOTAL: 1000
increased timeout OK: 713 NOT OK: 287 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000

Unless I'm mistaken, increasing the timeout fixes the issues on FreeBSD but somehow makes it worse on macOS?

I started a third run with fresh revisions because I find this hard to believe (master and with patch).

Update: Third run (master and with patch):

osx11 freebsd12-x64
master OK: 916 NOT OK: 84 TOTAL: 1000 OK: 838 NOT OK: 162 TOTAL: 1000
increased timeout OK: 721 NOT OK: 279 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000
tniessen commented 2 years ago

Median success rate (refer to patch and separate tables above):

osx11 freebsd12-x64
master (2000ms) 91.0 % 83.8 %
patched (5000ms) 71.3 % 100 %
ShogunPanda commented 2 years ago

@richardlau Thanks, I think one commit closed this by mistake.

ShogunPanda commented 2 years ago

@tniessen This is very strange. I code on MacOS and never got a failure...

tniessen commented 2 years ago

@richardlau Based on https://github.com/nodejs/node/issues/42741#issuecomment-1110926314, would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS? Do GitHub actions respect our list of flaky tests?

richardlau commented 2 years ago

@richardlau Based on #42741 (comment), would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS?

I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.

Do GitHub actions respect our list of flaky tests?

They should do: https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/.github/workflows/test-macos.yml#L34 https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/Makefile#L526-L530

richardlau commented 2 years ago

I've no idea if it's the same underlying cause, but I've just spotted on LinuxONE: https://ci.nodejs.org/job/node-test-commit-linuxone/31851/nodes=rhel8-s390x/consoleFull

12:30:49 not ok 1273 parallel/test-http-server-request-timeout-keepalive
12:30:49   ---
12:30:49   duration_ms: 6.565
12:30:49   severity: fail
12:30:49   exitcode: 1
12:30:49   stack: |-
12:30:49     Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
12:30:49         at Proxy.mustCall (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:396:10)
12:30:49         at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-http-server-request-timeout-keepalive.js:77:27)
12:30:49         at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:438:15)
12:30:49         at Object.onceWrapper (node:events:641:28)
12:30:49         at Server.emit (node:events:527:28)
12:30:49         at emitListeningNT (node:net:1414:10)
12:30:49         at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
12:30:49   ...
ShogunPanda commented 2 years ago

I think this might have been an error from me, I have:

client.on('error', errOrEnd);
client.on('end', errOrEnd);

So being called two times is legit if things are not closed fast enough,

ShogunPanda commented 2 years ago

Should I mark this test and all tests in https://github.com/nodejs/node/pull/42893 (without raising the timeout)? They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...

(just to clarify: they pass on most platforms so I don't think we're ignoring a bug)

tniessen commented 2 years ago

I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.

@richardlau I opened #42962 to mark the two known problematic tests as flaky on macOS. Increasing the timeout seems to fix the FreeBSD failures.

They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...

The strange part is that the error rate seems to increase when the timeout increases (see https://github.com/nodejs/node/issues/42741#issuecomment-1110926314), so I am not sure if macOS being slow is the issue here.

ShogunPanda commented 2 years ago

@tniessen You are right. I actually spotted a bug while working on #42812, see changes in https://github.com/nodejs/node/pull/42812/files#diff-bbcf7a115fe4680f3a7c076a487c89bd4ce935e7d2a726cfcb3f69331705e25eR260.

~Do you mind trying again first with and 2 and then 5 seconds once that lands?~

Both related PR are now in master. Can you please try again with 2 and/or 5 seconds?

ShogunPanda commented 2 years ago

@tniessen Any update on this? Are tests improving?

joyeecheung commented 2 years ago

It seems the flake is still there: https://github.com/nodejs/reliability/issues/273

Reason parallel/test-http-server-request-timeouts-mixed
Type JS_TEST_FAILURE
Failed PR 13 (https://github.com/nodejs/node/pull/42960/, https://github.com/nodejs/node/pull/42963/, https://github.com/nodejs/node/pull/42623/, https://github.com/nodejs/node/pull/42968/, https://github.com/nodejs/node/pull/42867/, https://github.com/nodejs/node/pull/37769/, https://github.com/nodejs/node/pull/42423/, https://github.com/nodejs/node/pull/42601/, https://github.com/nodejs/node/pull/43005/, https://github.com/nodejs/node/pull/42970/, https://github.com/nodejs/node/pull/43036/, https://github.com/nodejs/node/pull/42725/, https://github.com/nodejs/node/pull/43023/)
Appeared test-orka-macos10.15-x64-1, test-macstadium-macos11.0-arm64-3, test-orka-macos11-x64-2, test-nearform-macos10.15-x64-1, test-orka-macos11-x64-1
First CI https://ci.nodejs.org/job/node-test-pull-request/43865/
Last CI https://ci.nodejs.org/job/node-test-pull-request/43960/
Example ``` not ok 1166 parallel/test-http-server-request-timeouts-mixed --- duration_ms: 3.255 severity: fail exitcode: 1 stack: |- node:assert:399 throw err; ^ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value: assert(request2.completed) at Timeout._onTimeout (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-http-server-request-timeouts-mixed.js:106:5) at listOnTimeout (node:internal/timers:564:17) at process.processTimers (node:internal/timers:507:7) { generatedMessage: true, code: 'ERR_ASSERTION', actual: false, expected: true, operator: '==' } Node.js v19.0.0-pre ... ```
LiviaMedeiros commented 2 years ago

I just comprehended what had happened here. The issue was mistakenly closed by cross-connection via pushing in another repository a commit having Fixes: reference. The push contained up to 567 latest commits (starting from 2022-04-01) from node repository, rebased on top of test commits.

Checking with curl https://api.github.com/users/LiviaMedeiros/events?per_page=100 | jq '.[]|select(.type=="IssuesEvent")' confirmed that only this issue was affected.

I deeply apologize for this embarassing incident. In the future, I'll keep anything like that on a separate GitHub account. Please let me know if there are other side effects or something to undo.

@aduh95 thanks for reopening this. Pinging you as a TSC member (if I made a mistake is applicable).

targos commented 1 year ago

I got a failure in https://github.com/nodejs/node/pull/44741#issuecomment-1260454905

aduh95 commented 3 weeks ago

Getting some failures again on osx11 ([test-http-server-request-timeouts-mixed](https://ci.nodejs.org/job/node-test-pull-request/59813/, https://ci.nodejs.org/job/node-test-pull-request/59820/)