nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.24k stars 29.41k forks source link

AIX: pseudo-tty/no_dropped_stdio failures on new machine #7973

Closed mhdawson closed 7 years ago

mhdawson commented 8 years ago

Failures on new AIX machine: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/283/console


# 
# assert.js:89
#   throw new assert.AssertionError({
#   ^
# AssertionError: 1 == 42
#     at ChildProcess. (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-stdio-closed.js:25:10)
#     at ChildProcess. (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common.js:411:15)
#     at emitTwo (events.js:106:13)
#     at ChildProcess.emit (events.js:191:7)
#     at Process.ChildProcess._handle.onexit (internal/child_process.js:204:12)
  ---

  ...
length differs.
expect=21
actual=0
patterns:
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^oooooooooooooooooooooooooooooooooooooooooooooooo$
pattern = ^ooooooooooooooooooooooooO$
outlines:
not ok 1170 pseudo-tty/no_dropped_stdio
  ---
  duration_ms: 0.186
Fishrock123 commented 8 years ago

@mhdawson Anything different about this new machine? Didn't it pass on the old ones?

jbergstroem commented 8 years ago

Related to JOBS= ?

mhdawson commented 8 years ago

It did pass on the old one. The previous machine was a vm from siteox with different levels of the OS as opposed to the new ones that are from osusol. I have asked @gireeshpunathil to investigate.

gireeshpunathil commented 8 years ago

The test case want to make sure every byte that is written to the standard stream is drained before the process exits, basically 1025 characters with 21 new lines.

The test case seem to have devised to validate an OS X behavior on the 1025th character. However in AIX case, at times it receives none while the python parent expects 21 lines.

The stdout of the child is piped into a tempfile, which is later read by the python parent.

I tried to isolate the issue in a number of manners without success: (though many of these different variant of one or two discrete scenarios)

  1. node no_dropped_stdio.js | xargs echo
  2. node no_dropped_stdio.js | wc -c
  3. node no_dropped_stdio.js > out.txt 2>&1
  4. A ndoe parent with piped streams for the child
  5. A node parent with inherited streams for the child
  6. A node parent with file fd passed for child streams and later read from the file

So it looks like the python parent in the test framework has something peculiar or the machine I am running is different from the new CI one.

@mhdawson how easy it is to get hold of the failing CI system for debugging?

mhdawson commented 8 years ago

Created this request to get Gireesh access https://github.com/nodejs/build/issues/463

Fishrock123 commented 8 years ago

@gireeshpunathil None of those will work correctly. Your examples are all pipes.

You need to use a faked TTY. Either via pty.js or via the python test runner in pseudo-tty, which uses a faked TTY. (Code mostly in tools/test.py.)

To be clear: if AIX actually fails this then there is a big problem on AIX even if the test was designed for OS X.

gireeshpunathil commented 8 years ago

Thanks @Fishrock123 - yes, I agree that all the above cases are variants of pipes.

I actually debugged the python code, and validated that the spawing code does not use faketty in this specific scenario, rather creates a temp file, its fd piped into the child node's streams, and then upon child termination, close the fd and read from the file - hence my attempt to mimic the behavior through case 6 above.

However, this is really useful information, I will study the faketty and debug further by taking lead from that. Thanks for the suggestion.

Fishrock123 commented 8 years ago

@gireeshpunathil it is possible that the python fake tty does not work on AIX -- in which case these tests should be disabled there.

mhdawson commented 8 years ago

@gireeshpunathil I just want to make sure you are investigating both failures

parallel/test-stdio-closed pseudo-tty/no_dropped_stdio

parallel/test-stdio-closed - fails in every run pseudo-tty/no_dropped_stdio - I've not seen fail in the community CI again, however, internally (IBM) we've reproduced with 316/1000 if you need help to reproduce follow up with @sxa555 or Jeremiah Akpotohwo in hursley

mhdawson commented 8 years ago

Have provided @gireeshpunathil with access to community machine

mhdawson commented 8 years ago

@gireeshpunathil have you had a change to look at this ?

gireeshpunathil commented 8 years ago

@mhdawson - yes, but is a slow progress. Any major findings, I will post here.

gireeshpunathil commented 8 years ago

just a status update: no major insights so far. Issue seen as consistent, trying to localize it into a python / python + C test case: to eliminate node's role in it.

gireeshpunathil commented 8 years ago

looks like AIX has pseudo ttys, and it is enabled in the failing system: bash-4.3$ lsattr -l pty0 -E ATTnum 256 Maximum number of Pseudo-Terminals True BSDnum 16 Maximum number of BSD Pseudo-Terminals True autoconfig available STATE to be configured at boot time True csmap sbcs N/A True

What is not clear is how python in AIX is implemented with respect to pty. Will need to try out a test case to validate the sanity for pty.

gireeshpunathil commented 8 years ago

The failure is manifested as a timeout, as the python parent does not receive anything from the node child for the timeout duration.

ps -ef output showed:

punathil 3015016 1835728   0 03:23:48  pts/2  0:00 python tools/test.py --mode=release pseudo-tty
punathil 3211902 3015016  64                  0:00 <defunct>

This would mean that the child node has exited (probably after writing the data onto its stdout) but is shown as defunct because the python parent has not yet processed the SIGCHLD signal.

When I attach a debugger to the parent, this is what I see:

dbx) where
unnamed block in posix_close_(self = (nil), args = 0x3038a978), line 6651 in "posixmodule.c"
posix_close_(self = (nil), args = 0x3038a978), line 6651 in "posixmodule.c"
PyCFunction_Call(func = 0x300732a8, arg = 0x3038a978, kw = (nil)), line 81 in "methodobject.c"
unnamed block in call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
unnamed block in call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
call_function(pp_stack = 0x2ff1a21c, oparg = 1), line 4350 in "ceval.c"
unnamed block in PyEval_EvalFrameEx(f = 0x304235c8, throwflag = 0), line 2987 in "ceval.c"

IAR showed at 0xd2ed8eb8 which means we are in the close system call:

dbx) (0xd2ed8eb0)/10i
0xd2ed8eb0 ($b2691+0xc)  80610040         lwz   r3,0x40(r1)
0xd2ed8eb4 ($b2691+0x10) 48026a55          bl   0xd2eff908 (close)    
0xd2ed8eb8 ($b2691+0x14) 80410014         lwz   r2,0x14(r1)

So the questions now are:

  1. Why the python parent thread is not closing?
  2. What happened to the child data written to its stdout which were faked by python into its memory?

Investigating further.

mhdawson commented 8 years ago

Does this investigation apply to one or both of:

1) parallel/test-stdio-closed 2) pseudo-tty/no_dropped_stdio

?

gireeshpunathil commented 8 years ago

This is only for 2) pseudo-tty/no_dropped_stdio

gireeshpunathil commented 8 years ago

ok, came up with a small python code which works well in Linux but hangs in AIX. This is detached from node, which means that the problem is outside the node scope. While the code comes out immediately in Linux, the usage of fake_out for the stdout causes the python thread to hang in close() call.

import os
import subprocess
import pty

arg = ['ls', '-lrt']
(out_master, fd_out) = pty.openpty()
process = subprocess.Popen(
  args = arg,
  stdout = fd_out
)

os.close(fd_out)

I think the psuedo-tty tests should be omitted for AIX, while the python behavioral difference is understood and or addressed - as they don't reveal any issues in node instead blocks the test logic due to the python bug.

gireeshpunathil commented 8 years ago

parallel/test-stdio-closed failure - test case performs the following:

  1. spawns a child (shell) with inhertied streams from parent.
  2. the shell executes a child node, whose streams are inherently closed.
  3. Child writes to stdout, and then onto stderr, and then exits with a specific error code.
  4. Parent asserts on the error code.

In Linux, the child executes the stream writes and then exits with the specified error code (42). In AIX, the child exits on the first attempt to write on to the closed stream (error code:1)

And hence the failure: AssertionError: 1 == 42

Investigating further.

Fishrock123 commented 8 years ago

Should we be marking pseudo-tty/no_dropped_stdio as flaky on AIX? Or is it already and I missed it?

mhdawson commented 8 years ago

Pulled out stdio failure from description in initial text for report since it is not being worked in this issue and does not seem related. https://github.com/nodejs/node/issues/8375

mhdawson commented 8 years ago

Just talking with @gireeshpunathil , I'll submit a PR to mark as flaky

mhdawson commented 8 years ago

PR to mark as flaky https://github.com/nodejs/node/pull/8385/files

mhdawson commented 7 years ago

Make sure to un-exclude this one as well once resolved: https://github.com/nodejs/node/issues/9765