Closed mhdawson closed 7 years ago
@mhdawson Anything different about this new machine? Didn't it pass on the old ones?
Related to JOBS= ?
It did pass on the old one. The previous machine was a vm from siteox with different levels of the OS as opposed to the new ones that are from osusol. I have asked @gireeshpunathil to investigate.
The test case want to make sure every byte that is written to the standard stream is drained before the process exits, basically 1025 characters with 21 new lines.
The test case seem to have devised to validate an OS X behavior on the 1025th character. However in AIX case, at times it receives none while the python parent expects 21 lines.
The stdout of the child is piped into a tempfile, which is later read by the python parent.
I tried to isolate the issue in a number of manners without success: (though many of these different variant of one or two discrete scenarios)
So it looks like the python parent in the test framework has something peculiar or the machine I am running is different from the new CI one.
@mhdawson how easy it is to get hold of the failing CI system for debugging?
Created this request to get Gireesh access https://github.com/nodejs/build/issues/463
@gireeshpunathil None of those will work correctly. Your examples are all pipes.
You need to use a faked TTY. Either via pty.js or via the python test runner in pseudo-tty
, which uses a faked TTY. (Code mostly in tools/test.py
.)
To be clear: if AIX actually fails this then there is a big problem on AIX even if the test was designed for OS X.
Thanks @Fishrock123 - yes, I agree that all the above cases are variants of pipes.
I actually debugged the python code, and validated that the spawing code does not use faketty in this specific scenario, rather creates a temp file, its fd piped into the child node's streams, and then upon child termination, close the fd and read from the file - hence my attempt to mimic the behavior through case 6 above.
However, this is really useful information, I will study the faketty and debug further by taking lead from that. Thanks for the suggestion.
@gireeshpunathil it is possible that the python fake tty does not work on AIX -- in which case these tests should be disabled there.
@gireeshpunathil I just want to make sure you are investigating both failures
parallel/test-stdio-closed pseudo-tty/no_dropped_stdio
parallel/test-stdio-closed - fails in every run pseudo-tty/no_dropped_stdio - I've not seen fail in the community CI again, however, internally (IBM) we've reproduced with 316/1000 if you need help to reproduce follow up with @sxa555 or Jeremiah Akpotohwo in hursley
Have provided @gireeshpunathil with access to community machine
@gireeshpunathil have you had a change to look at this ?
@mhdawson - yes, but is a slow progress. Any major findings, I will post here.
just a status update: no major insights so far. Issue seen as consistent, trying to localize it into a python / python + C test case: to eliminate node's role in it.
looks like AIX has pseudo ttys, and it is enabled in the failing system: bash-4.3$ lsattr -l pty0 -E ATTnum 256 Maximum number of Pseudo-Terminals True BSDnum 16 Maximum number of BSD Pseudo-Terminals True autoconfig available STATE to be configured at boot time True csmap sbcs N/A True
What is not clear is how python in AIX is implemented with respect to pty. Will need to try out a test case to validate the sanity for pty.
The failure is manifested as a timeout, as the python parent does not receive anything from the node child for the timeout duration.
ps -ef output showed:
punathil 3015016 1835728 0 03:23:48 pts/2 0:00 python tools/test.py --mode=release pseudo-tty
punathil 3211902 3015016 64 0:00 <defunct>
This would mean that the child node has exited (probably after writing the data onto its stdout) but is shown as defunct because the python parent has not yet processed the SIGCHLD signal.
When I attach a debugger to the parent, this is what I see:
dbx) where
unnamed block in posix_close_(self = (nil), args = 0x3038a978), line 6651 in "posixmodule.c"
posix_close_(self = (nil), args = 0x3038a978), line 6651 in "posixmodule.c"
PyCFunction_Call(func = 0x300732a8, arg = 0x3038a978, kw = (nil)), line 81 in "methodobject.c"
unnamed block in call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
unnamed block in call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
unnamed block in PyEval_EvalFrameEx(f = 0x304235c8, throwflag = 0), line 2987 in "ceval.c"
IAR showed at 0xd2ed8eb8 which means we are in the close system call:
dbx) (0xd2ed8eb0)/10i
0xd2ed8eb0 ($b2691+0xc) 80610040 lwz r3,0x40(r1)
0xd2ed8eb4 ($b2691+0x10) 48026a55 bl 0xd2eff908 (close)
0xd2ed8eb8 ($b2691+0x14) 80410014 lwz r2,0x14(r1)
So the questions now are:
Investigating further.
Does this investigation apply to one or both of:
1) parallel/test-stdio-closed 2) pseudo-tty/no_dropped_stdio
?
This is only for 2) pseudo-tty/no_dropped_stdio
ok, came up with a small python code which works well in Linux but hangs in AIX. This is detached from node, which means that the problem is outside the node scope. While the code comes out immediately in Linux, the usage of fake_out for the stdout causes the python thread to hang in close() call.
import os
import subprocess
import pty
arg = ['ls', '-lrt']
(out_master, fd_out) = pty.openpty()
process = subprocess.Popen(
args = arg,
stdout = fd_out
)
os.close(fd_out)
I think the psuedo-tty tests should be omitted for AIX, while the python behavioral difference is understood and or addressed - as they don't reveal any issues in node instead blocks the test logic due to the python bug.
parallel/test-stdio-closed failure - test case performs the following:
In Linux, the child executes the stream writes and then exits with the specified error code (42). In AIX, the child exits on the first attempt to write on to the closed stream (error code:1)
And hence the failure: AssertionError: 1 == 42
Investigating further.
Should we be marking pseudo-tty/no_dropped_stdio
as flaky on AIX? Or is it already and I missed it?
Pulled out stdio failure from description in initial text for report since it is not being worked in this issue and does not seem related. https://github.com/nodejs/node/issues/8375
Just talking with @gireeshpunathil , I'll submit a PR to mark as flaky
PR to mark as flaky https://github.com/nodejs/node/pull/8385/files
Make sure to un-exclude this one as well once resolved: https://github.com/nodejs/node/issues/9765
Failures on new AIX machine: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/283/console