nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.56k stars 29.05k forks source link

stdio buffered writes (chunked) issues & process.exit() truncation #6456

Closed eljefedelrodeodeljefe closed 3 years ago

eljefedelrodeodeljefe commented 8 years ago

If this is currently breaking your program, please use this temporary fix:

[process.stdout, process.stderr].forEach((s) => {
  s && s.isTTY && s._handle && s._handle.setBlocking &&
    s._handle.setBlocking(true)
})

As noted in #6297 async stdio will not be flushed upon immediate process.exit(). This may lay open general deficiencies around C exit() from C++ functions not being properly unwound and is probably not just introduced by latest libuv updates. It should be considered to add flushing, providing graceful exit and/or improving unwinding C++ stacks.

cc @jasnell, @kzc, @Qix-, @bnoordhuis

Issues

Discussion has been already taking place at several places, e.g. #6297, #6456, #6379

Summaries of Proposals

proposals are not exclusive and could lead to semantically unrelated contributions.

  • aid with process.stdout.flush()
  • process.setBlocking(true)
  • node --blocking-stdio
  • longjmp() towards main at exit in C++
  • move parts of process.exit() / process.reallyExit() to new method os.exit()
  • golang panic()- or c++ throw-like stack unwinding

    Discussions by Author (with content)


@ChALkeR I tried to discuss this some time ago at IRC, but postponed it for quite a long time. Also I started the discussion of this in #1741, but I would like to extract the more specific discussion to a separate issue.

I could miss some details, but will try to give a quick overview here.

Several issues here:

  1. Many calls to console.log (e.g. calling it in a loop) could chew up all the memory and die — #1741, #2970, #3171.
  2. console.log has different behavior while printing to a terminal and being redirected to a file. — https://github.com/nodejs/node/issues/1741#issuecomment-105333932.
  3. Output is sometimes truncated — #6297, there were other ones as far as I remember.
  4. The behaviour seems to differ across platforms.

As I understand it — the output has an implicit write buffer (as it's non-blocking) of unlimited size.

One approach to fixing this would be to:

  1. Introduce an explicit cyclic write buffer.
  2. Make writes to that cyclic buffer blocking.
  3. Make writes from the buffer to the actual output non blocking.
  4. When the cyclic buffer reaches it's maximum size (e.g. 10 MiB) — block further writes to the buffer until a corresponding part of it is freed.
  5. On (normal) exit, make sure the buffer is flushed.

For almost all cases, except for the ones that are currently broken, this would behave as a non-blocking buffer (because writes to the buffer are considerably faster than writes from the buffer to file/terminal).

For cases when the data is being piped to the output too quickly and when the output file/terminal does not manage to output it at the same rate — the write would turn into a blocking operation. It would also be blocking at the exit until all the data is written.

Another approach would be to monitor (and limit) the size of data that is contained in the implicit buffer coming from the async queue, and make the operations block when that limit is reached.

kzc commented 8 years ago

This is about having stdio finish chunked writes, mostly.

First the pending uv writes that node no longer has control over, then the chunked node writes.

This used to work

It never truly worked on UNIX, just the write thresholds were somewhat higher. As of node 6.x Mac buffers 1024 bytes of writes at the OS level and Linux buffers 64KB of writes. Non blocking writes will always succeed up until this threshold.

FWIW, the patch https://github.com/kzc/node/commit/92fc9e0d992f043a4b92d9d286514328f5df1b6d completely flushes stdio uv writes and node stream chunked writes at process.exit(). I think this behavior is expected by node users.

What we need is essentially a uv_graceful_stop()

I'd be in favor of that as long as no callbacks are invoked by libuv during such a blocking call so that no more work can be scheduled in user land. And only pending writes to stdio and files are flushed. The socket stream data can be in flight at the OS level anyway and no attempt should be made to flush them in my opinion.

Fishrock123 commented 8 years ago

It never truly worked on UNIX

Sometime prior to v1.0.0, it actually did, as far as I can tell. Again, conflating the original issue, with the amplification of it. :)

kzc commented 8 years ago

Prior to v1.0.0, it actually did

Before my time and a lot of other node developers.

Instead of looking at history I think we should just address the problems as they stand now.

Fishrock123 commented 8 years ago

Instead of looking at history I think we should just address the problems as they stand now.

It's the same problem.

saghul commented 8 years ago

Instead of looking at history I think we should just address the problems as they stand now.

Looking at history is important not to make the same mistake twice. There is tons of knowledge which is unfortunately not very accessible, in the form of commit logs, but one has to know what they're looking for.

Your approach could work, but the libuv bits will need polishing and Windows support.

kzc commented 8 years ago

Looking at history is important not to make the same mistake twice. There is tons of knowledge which is unfortunately not very accessible, in the form of commit logs, but one has to know what they're looking for.

Fair enough.

Your approach could work, but the libuv bits will need polishing and Windows support.

I'm the first to admit my libuv change was a hack. It's a working proof of concept. I was going to code up something to walk the event queue and flush pending write events when I discovered that uv__write already did exactly what I needed.

For the record, does Windows as of node 6.x exhibit this stdio not flushing upon process.exit() problem? The expected fail test case for this issue was not run on Windows suggesting to me that it might have actually worked. My patch moved that expected fail test to be a normal (passing) test.

Edit: the renamed test is here:

https://github.com/kzc/node/blob/92fc9e0d992f043a4b92d9d286514328f5df1b6d/test/parallel/test-stdout-buffer-flush-on-exit.js

bcoe commented 8 years ago

I think this is more of a special case @jasnell. If I call process.exit() it's my expectation that the process will exit within a tick; if there's an outbound HTTP request, as an example, I wouldn't want to wait for this to complete prior to killing the process.

However, flushing stdio is an exception:

Carving out an exception for stdio keeps the platform's behavior with what the community is accustomed to.

sam-github commented 8 years ago

The addition of process.exitCode made this much better, the correct way to exit a node app now that has pending I/O (it called console.log) is to set exit code, and to close all handles, timers, etc. I.e., to do a graceful exit. This can be a bit annoying, in that all your resources need to be tracked and closeable, but they should be, anyhow, if you want to exit gracefully!

This works well for us... except for node v0.10, which has no process.exitCode.... on v0.10 you need to emulate it by setting a global code, then doing the graceful resource cleanup... then in the on exit handler call process.exit(code)... which is pretty ugly. I wish we could backport process.exitCode to v0.10.

Part of the problem here is that process.exit() just seems to be a reasonable way to exit node... when it isn't really, its a "terminate with prejudice" directive. This is compounded by the fact that on Unix only (not Windows), console.log used to have a special exemption to make it blocking if and only if it was a terminal... not if it was piped. So people learnt to rely on this quirk.

sam-github commented 8 years ago

Oh, and CLI parsers that are calling console.log and then calling process.exit... thats just doing it wrong. They need to call log and throw an exception, and let the caller catch it and clean up gracefully before exit.

bcoe commented 8 years ago

@sam-github this is a pattern that has been used by almost every CLI application in the ecosystem since the beginning of time, to terminate execution before it applies to the libraries' consumer -- the alternative would be literally 10000+ consuming libraries changing their contract with commander, yargs, optimist, meow, tap, the list goes on.

isaacs commented 8 years ago

Saying "you should just do a graceful exit always" is not a valuable answer in this case, in the sense that it does not meaningfully move us towards a solution state. (Unless, of course, it's followed up with several dozen (hundred?) pull requests to refactor the many impacted programs in the suggested manner.)

The fact is that the contract changed, pretty dramatically. Maybe not the intended or documented contract, but the actual contract (as in, the way that node actually works) changed very dramatically here.

It's fine to point the finger at the userbase and say "Welp, they were doing it wrong, too bad", but the platform is mature and that kind of casual breakage is foolish. Many extant programs won't work on Node 6 because of this. Who's right or wrong hardly matters if users can't use your platform, and complaining about an un-boiled ocean doesn't solve the problem.

What's the sense in arguing when you're all alone?

eljefedelrodeodeljefe commented 8 years ago

I think both positions are arguable and can be combined in a fix in node. But I also think use of process.exit() was always and I cannot stress enough always wrong and should be slowly discouraged. By the way so was process.reallyExit so it's just historic...very historic. Looking at the repos in question fixes for gracefully exiting seem quite easy though.

bcoe commented 8 years ago

@eljefedelrodeodeljefe I'm comfortable with this, I'm fine with pushing people towards the new behavior with yargs@5.x, however there are 13,000 consuming libraries on prior versions and I don't want to break them.

isaacs commented 8 years ago

@eljefedelrodeodeljefe Up until relatively recently, there was literally no other way to set the exit code of a node process other than calling process.exit(), and it's been safe to assume that stdio streams would synchronously flush on exit for a very long time.

It's fine and good to complain about what's right or wrong. Ok, it was wrong to call process.exit, always and forever, we should feel shame for this. So we feel shame. Who cares? Node 6 broke a very significant number of node programs with this change in behavior.

So do we care about that breakage or not? It seems to me that turning our back on tens of thousands of broken programs is a choice that should not be taken lightly!

kzc commented 8 years ago

Updated process.exit fix: https://github.com/kzc/node/commit/a73ec2f007020aed2837800afc2169ace11d4c02 https://github.com/kzc/node/commit/29997921800e00a22d9f92d24704a0021be03bbf

eljefedelrodeodeljefe commented 8 years ago

@bcoe @isaacs sure, I care deeply fixing it - especially because the use is so widespread and clis being so crucial for the ecosystem. I didn't see a decent fix just yet :( My point is just that it should be a combined effort. If you see ways of mitigating this in future versions of the repos please go ahead immediately and node should fix everything backwards.

Funny enough I was suggesting using the EE instead of exit handlers as main control flow of a CLI some months ago but didn't see at the time that it's gonna explode that much.

Let's come up with good ideas over thee weekend! :)

kzc commented 8 years ago

@eljefedelrodeodeljefe Not sure what you're looking for in a fix that isn't some variation of the simple patch I proposed. stdio libuv queued writes and queued node chunked writes have to be flushed in a sync manner prior to exit(). That's all.

Fishrock123 commented 8 years ago

and it's been safe to assume that stdio streams would synchronously flush on exit for a very long time.

Node 6 broke a very significant number of node programs with this change in behavior.

Can someone point me to exactly how the behavior changed in v6? As far as I am aware, it is only easier to trigger, and that it is indeed the issue I mentioned previously. (I.e. writes become chunked at a smaller size) Is that true? (@saghul?)

Qix- commented 8 years ago

@kzc submit it as a PR here.

kzc commented 8 years ago

@Qix- I just put together the proof of concept to show this issue could be solved in a straightforward manner. I'm hoping some else would run with it and do the Windows piece - assuming it is required. Because the patch touches libuv it might be better for those folks to do that part. Presently uv_flush_sync, a.k.a. uv__write, does not return any error code. Inspecting such an error code would be wise before flushing the node-side stream chunks otherwise there could be a gap in the data. Or perhaps the libuv team would prefer to implement some variation of @jasnell's uv_graceful_stop() proposal. Either would be fine with me.

kzc commented 8 years ago

@saghul Is there a libuv function that provides the number of bytes pending to be written for a uv_stream_t*? That info would be sufficient to know whether the flushing of the node-side chunks can proceed - i.e., when libuv bytes pending to written for the stream is 0. If this function does not already exist it would be useful if uv__write() were to return that value.

saghul commented 8 years ago

Can someone point me to exactly how the behavior changed in v6? As far as I am aware, it is only easier to trigger,

That's my understanding. While we were doing async writes before 1.0, it's possible that the pty fix which landed in 1.9.0 also fixed it on OSX, thus making them really async.

I think we haven't understood 100% where the problem is. We do see its effect. So, it would be interesting to take Node 5 and see if stdout writes block or not on OSX. Then the same with Node 6 (which do not block).

saghul commented 8 years ago

Is there a libuv function that provides the number of bytes pending to be written for a uv_stream_t*?

@kzc see: http://docs.libuv.org/en/v1.x/stream.html#c.uv_stream_t.write_queue_size

Fishrock123 commented 8 years ago

it's possible that the pty fix which landed in 1.9.0 also fixed it on OSX, thus making them really async.

I don't think so, if you run https://github.com/nodejs/node/blob/master/test/known_issues/test-stdout-buffer-flush-on-exit.js on an OS X machine prior to that it also happens as expected.

I think we haven't understood 100% where the problem is. We do see its effect.

My understanding at the time when I investigated it with bnoordhuis was that flushing wasn't happening at the OS level, I think?

So, it would be interesting to take Node 5 and see if stdout writes block or not on OSX. Then the same with Node 6 (which do not block).

Hmmmm, I'm quite certain it did not block but I don't have proof of that, or at least not off-hand.

kzc commented 8 years ago

Revised process.exit fix with improved error checking: https://github.com/kzc/node/commit/29997921800e00a22d9f92d24704a0021be03bbf

kzc commented 8 years ago

@Fishrock123 The patch https://github.com/kzc/node/commit/29997921800e00a22d9f92d24704a0021be03bbf renames the test you mentioned to test/parallel/test-stdout-buffer-flush-on-exit.js and runs it successfully on Mac and Linux.

bcoe commented 8 years ago

for the public record, here's the workaround that I'm about to land for yargs (commander should be able to use a similar approach):

https://github.com/yargs/yargs/pull/501

Fishrock123 commented 8 years ago

I did some more digging for The yargs issue. Here's some old discussion of "Fix blocking / non-blocking stdio woes": https://github.com/nodejs/node-v0.x-archive/issues/3584

Currently (At the time of linked issue) process.stdin / stdout / stderr is blocking, except when it is a pipe on windows. Weird and surprising. Very unpractical in cases where stdio is used as an IPC mechanism between node processes.

Also, net.Socket#_handle.setBlocking() appears to have been added in https://github.com/nodejs/node/commit/20176a98416353d4596900793f739d5ebf4f0ee1 (v0.11.2)

bcoe commented 8 years ago

created a shim here for anyone else running into this issue.

kzc commented 8 years ago

@bcoe Be aware that calling setBlocking(true) is not a cure all. If a large write takes place before setBlocking(true) it does not work.

See: https://github.com/nodejs/node/issues/6456#issuecomment-218243381

That's why the process.exit() issue should be addressed in node itself and back ported.

saghul commented 8 years ago

Is test-stdout-buffer-flush-on-exit.js supposed to be reliable? It fails for me on Node 4, 5 and 6 on OSX and on Node 5 and 6 on Linux (haven't tested 4 there).

saghul commented 8 years ago

Is there a reliable test we can run with git bisect to try to understand where the problem originates? This one also fails for me with Node 4 on Linux. I'm really confused now. :-S

kzc commented 8 years ago

@saghul test-stdout-buffer-flush-on-exit.js is an expected-fail on node 4, 5 and 6 (without my patch).

Here's another program that fails to run as expected on node 4, 5 and 6 in a unix terminal:

// this program populates the libuv write queue upon first write over 64K
// then will populate the node stream chunk queue for subsequent writes.
for (var i = 1; i <= 1000; ++i) {
  process.stdout.write((i + 
    ': The quick brown fox jumps over the lazy dog.\n').repeat(1500));
}
process.exit(1);

It will run successfully on Mac and Linux with https://github.com/kzc/node/commit/29997921800e00a22d9f92d24704a0021be03bbf

saghul commented 8 years ago

@kzc I know your patch will make it work. What I want to know is what and where changed subtly between Node 4 and Node 6, since the core principle (writes being async) remained.

IOW, we need a test which passes in 4 (and maybe 5) but doesn't in 6.

kzc commented 8 years ago

The behavior only noticeably changed on Mac with node 6.0.0/libuv 1.9.0 at the tty: https://github.com/nodejs/node/issues/6456#issuecomment-215591289

Do not pipe or redirect the output as it changes the test and a different code path is taken in node. test-stdout-buffer-flush-on-exit.js is a pipe test, not a tty test.

It appears as if the fd is blocking at the tty in node 4 and 5 on Mac, but non-blocking on node 6 at the tty.

Linux behavior is the same in node 4, 5 and 6 - 64K is output upon process.exit() at tty before truncation.

kzc commented 8 years ago

But even when stdio is piped, node 4, 5 and 6 never completely flushed stdio upon process.exit(). I believe these stdio streams ought to be completely flushed regardless of being run at the tty or piped or redirected to a file.

Fishrock123 commented 8 years ago

@saghul known_issues tests fail if the bug exists. Contrary to what @kzc says, this bug exists since Node.js (io.js) v1.0.0. It does not appear to exist in v0.12 (.x) or before. The recent libuv 1.9.0 patch appears to have made it so that chunked writes are triggered at much smaller buffer(?) sizes, meaning the bug appears far more easily.

It appears as if the fd is blocking at the tty in node 4 and 5 on Mac, but non-blocking on 6 at the tty.

@kzc This isn't correct as far as my investigations have gone (see above..), but when chunked only the first chunk is flushed on (fast) exit.

kzc commented 8 years ago

@Fishrock123 I'm aware of what known_issues is for. I did not mention node v1.0.0, nor v0.12.

tty behavior is different than piped behavior.

My patch fixes flushing of stdio at the tty or when piped upon process.exit().

Fishrock123 commented 8 years ago

tty behavior is different than piped behavior.

Ah, right. Perhaps that has changed. I have not taken a look.

kzc commented 8 years ago

See also: https://github.com/nodejs/node/issues/6456#issuecomment-215692810

Fishrock123 commented 8 years ago

Hmmmmmm, looks like it's not in v1.0.0 after all but rather v1.0.2 (close enough).

(I forgot that I had to run iojs on older versions instead of node.)

Bisect of v1.0.0 and v2.0.0:

07bd05ba332e078c1ba76635921f5448a3e884cf is the first bad commit
commit 07bd05ba332e078c1ba76635921f5448a3e884cf
Author: Saúl Ibarra Corretgé <saghul@gmail.com>
Date:   Wed Jan 14 20:26:02 2015 +0100

    deps: update libuv to 1.2.1

    PR: https://github.com/iojs/io.js/pull/423
    Reviewed-by: Ben Noordhuis <info@bnoordhuis.nl>
    Reviewed-by: Bert Belder <bertbelder@gmail.com>

:040000 040000 f2ba9fd93434ec2457a5e53d5c047efedad70462 789f762d0ea0e69993460303430f5851bf1de1a0 M  deps
bisect run success

@saghul That's https://github.com/nodejs/node/commit/07bd05ba332e078c1ba76635921f5448a3e884cf

Fishrock123 commented 8 years ago

From the libuv 1.2.1 changelog: * unix: set non-block mode in uv_{pipe,tcp,udp}_open (Ben Noordhuis)

Maybe that's the culprit? (of the original issue?)

Fishrock123 commented 8 years ago

@kzc Can you open a PR with your code? We will be better able to move forward in that format.

saghul commented 8 years ago

From the libuv 1.2.1 changelog: * unix: set non-block mode in uv_{pipe,tcp,udp}_open (Ben Noordhuis)

Maybe that's the culprit? (of the original issue?)

Yes, that the original change. But we already knew that. The question is what made the thing worse between 4 and 6. The only change I can think of is: https://github.com/libuv/libuv/commit/387102b2475776e5e40a3f6da5041eb674ad4abf but I can't see how opening /dev/tty or /dev/ttys00X changes anything.

Fishrock123 commented 8 years ago

Yes, that the original change. But we already knew that.

(We did?)

kzc commented 8 years ago

@Fishrock123 I'd rather someone else put the PR together as there will likely be changes required for windows libuv support and additional error handling if flushSync is to be used in a non-process.exit() context that I don't have the time for. I just wanted to get the ball rolling with a working proof of concept and don't care about attribution for the patch.

Fishrock123 commented 8 years ago

I've adapted the proposed fix into https://github.com/nodejs/node/pull/6773 for further development.

isaacs commented 8 years ago

@Fishrock123

Can someone point me to exactly how the behavior changed in v6?

The change in the chunk size is exactly the behavior that changed, and the change in behavior is significant enough to be very relevant.

Here's a test script: https://gist.github.com/isaacs/1495b91ec66b21d30b10572d72ad2cdd

The number of test characters can be set by the first argument, and you can pass noexit as the second argument to make it not call process.exit().

When run with stdout and stderr both outputting to the terminal, the expected behavior is to print out a series of o to the terminal, then a capital O, carriage return, followed by a series of x, then capital X, carriage return, and then process exit. This works up to the point where it maxes out my terminal's buffer.

In versions of Node prior to 6.0, that expectation was always met, even when running node issue-6456.js 100000. In 6.0 and 6.1, the stdout and stderr are munged together, and if process.exit() is called, then the streams don't end, for any value above 1kb.

If the strings are not just x's and o's (for example, control characters returning the cursor to the start of the line in order to do a progress bar, like a certain CLI used by most node devs), then it gets even more wild, because the overlap can occur at unexpected places. Pass fancy as the third argument to the script in order to test this.

$ node issue-6456.js 100000 x fancy
ooooooooooooooooooooooooO
xxxxxxxxxxxxxxxxxxxxxxxxX

$ nave use 6.0

$ node issue-6456.js 100000 x fancy
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx            # <-- note, no CR here
$ node issue-6456.js 100000 noexit fancy
ooooooooooooooooooooooooOxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxX
kzc commented 8 years ago

Regardless of the change introduced in node 6.0.0, stdout and stderr was never fully flushed at process.exit() on all platforms in node 4 and 5. Node on Linux only flushes it to 64K at the tty, for example. I think we should fix this problem once and for all and fully flush stdio streams upon process.exit() to match the behaviour of a "regular" non-process.exit() program.

Fishrock123 commented 8 years ago

@isaacs Thanks for the info, could you clarify what you mean by this?

In 6.0 and 6.1, the stdout and stderr are munged together


then it gets even more wild, because the overlap can occur at unexpected places.

Oh my, interesting.


Hmmm, I had forgotten the original issue didn't effect TTYs and only pipes. Strange that there doesn't seem to be a directly related libuv change.