Post Mortem Analysis with Async APIs

benjamingr commented 7 years ago

Currently, one workflow to debug NodeJS problems it to generate a core dump when an unhandled exception occurs.

The flow goes something like this:

Launch node with --abort-on-uncaught-exception
When an error is thrown globally (that is, not inside a try/catch or domain) - take a core dump on exit
Analyze the core dump for errors.

However, there are currently a lot of issues with the said flow and modern NodeJS that have not been addressed. I think post-mortem analysis based on core dumps is problematic with async APIs and I think we should solve that in order to ship a lot of things.

Operational errors and programmer errors are not distinguishable by being synchronous/asynchronous (examples in nodejs/promises in case anyone cares https://github.com/nodejs/promises/issues/10 etc). Our APIs default to not crash on errors if we do not check them explicitly (callbacks) or crash if there is no error handler (event emitters). JavaScript's concurrency primitives (promises, and soon observables and async iterators) default to not ignore errors (by raising an event) but not synchronously.
It is impossible to always abort and take a core dump on errors because our startup time is slow - see https://github.com/nodejs/post-mortem/issues/18 .
A lot of our APIs defer errors to nextTick, at which point stack traces are invalid (see https://github.com/nodejs/CTC/issues/12#issuecomment-295392382).

I think we're at a point where we're not giving users a very great experience for their tools (especially ones that are not experts).

Now, I'm not sure we can do much about the callback APIs, but before we have a real promised core we should address these concerns.

I'd also like to point out that this is specifically about state changes that happen within a microtask and that no I/O is processed in that time anyway today. So no I/O handling effects your core dump anyway.

Possible solutions for promises and core dumps:

Take a core dump when a rejection happens in your code, remove it when the rejection is handled.[1]
Abort synchronously with a flag if there is no catch handler and the promise was created more than a microtask ago.
Solve the problem only for async functions - since those must declare catch handlers synchronously and syntactically, and recommend that people who want to use promises use them.
Expose a flag that violates the ECMAScript standard and schedules promises synchronously.
Document that promises do not work with post-mortem debugging tools.

I'm going to ping @nodejs/post-mortem and current and previous people involved in promises work to discuss. I'd like to sort this once and for all and give users a better experience.

@misterdjules @mikeal @chrisdickinson @addaleax @Fishrock123 @petkaantonov @spion @thefourtheye @vkurchatkin @refack

[1]

@nodejs/post-mortem - How slow/fast is taking a core dump? Does it do copy on write? Is that possible?
We can do that only on throws, but since it is established that throw does not indicate a programmer error and a err callback does not indicate an operational one - I'm not sure if that makes sense.

addaleax commented 7 years ago

How slow/fast is taking a core dump?

The bottleneck is disk I/O, creating a core dump is basically the kernel copying the raw memory of a process to a file. (I would be really surprised if there were OSes that supported copy on write for coredumps, but I didn’t check that.)

jclulow commented 7 years ago

Launch node with --throw-on-uncaught-exception

Do you mean --abort-on-uncaught-exception?

benjamingr commented 7 years ago

@jclulow Nice catch 🙏! I started as copying it from the (then proposed) --throw-on-unhandled-rejection flag and forgot to change the throw part :)

jclulow commented 7 years ago

There essentially isn't a solution to this problem.

Promises by design swallow all of the synchronous exceptions that the interpreter emits to signal programmer errors, and it seems you cannot have a compliant Promises implementation that drops that unfortunate behaviour. If your Promises implementation isn't compliant, presumably it will break a bunch of software that depends on the way it is specified, which doesn't help users either.

I understand that advocates of this misguided model will cry foul and say that errors are not "swallowed", merely redirected into one asynchronous channel -- but that's actually at the heart of what's so abhorrent. By design, the system jams the synchronous exceptions that represent a logical inconsistency in the program, after which it cannot possibly be working as designed (e.g., ReferenceError, TypeError, etc), into the same channel as all of the operational errors (e.g., DNS lookup failures, connection time outs, etc).

At the end of the day, it just isn't a good idea. There doesn't appear to be a compromise that lets people use Promises as specified (and thus, presumably, as implemented in other venues like the browser), while retaining first-class post mortem debugging support for programmer errors. This is also true of indiscriminate wrapping of your entire program in a series of try/catch blocks, and I think the same solution applies: provide guidance that, if you want debuggable software, don't do that!

refack commented 7 years ago

At the end of the day, it just isn't a good idea. There doesn't appear to be a compromise that lets people use Promises as specified (and thus, presumably, as implemented in other venues like the browser), while retaining first-class post mortem debugging support for programmer errors. This is also true of indiscriminate wrapping of your entire program in a series of try/catch blocks, and I think the same situation applies: provide guidance that, if you want debuggable software, don't do that!

What about the suggested CLI flag? it might change the behaviour, but it might help fix a problem. Similar to Debug/Release builds.

jclulow commented 7 years ago

If you're going to change the behaviour, it may as well be a totally different interface. It will require testing your entire dependency tree in the compliant Promises environment as well as in the non-compliant environment -- and it's reasonably clear that isn't going to happen for the majority of software in the NPM registry.

vkurchatkin commented 7 years ago

I think abort-on-unhandled-rejection should solve everything and is totally doable. Anything else will probably make things even worse

refack commented 7 years ago

We are talking "post-mortem" something has already broken for the user. We should give them way to figure out what happened. Even if it's incomplete (as in they might find false positive in the way).

benjamingr commented 7 years ago

I understand that advocates of this misguided model will cry foul and say that errors are not "swallowed"

I'd rather we avoid that sort of tone.

By design, the system jams the synchronous exceptions that represent a logical inconsistency in the program, after which it cannot possibly be working as designed (e.g., ReferenceError, TypeError, etc), into the same channel as all of the operational errors (e.g., DNS lookup failures, connection time outs, etc).

This is incorrect, there are a lot of asynchronous programmer errors and a lot of synchronous operational errors. If you get passed incorrect input from an API and you JSON.parse it - it shouldn't "jam the system by design". I can dig up lots more examples if you'd like from the nodejs/promises repo whenwe discussed this.

(Another good example is listening on an invalid port - which throws synchronously but is clearly an operational error is the case 9 out of 10 times)

addaleax commented 7 years ago

If you're going to change the behaviour, it may as well be a totally different interface.

@jclulow That depends on how different the behaviour is. --abort-on-uncaught-exception is a behavioural change, too, but one that is small enough to not require testing of its own. Having something like that for Promises would be a good goal, even though I think it’s obvious that’s harder to achieve.

@refack This is specifically about post-mortem, do you mind removing off-topic comments?

refack commented 7 years ago

@addaleax edited.

refack commented 7 years ago

Wasn't that what node 0.x used to do? crash all the time. http://stackoverflow.com/questions/5999373/how-do-i-prevent-node-js-from-crashing-try-catch-doesnt-work. Horrible for deployment but useful for debugging.

MadaraUchiha commented 7 years ago

I'm going to repeat something that @mikeal said on the other thread.

Promises are here. Whether you want them or not, whether you use them in your code or not. A rapidly growing number of users, libraries, and open source projects are using Promises. You can't deny that.

Promises are trivial to implement on your own in under 200 LoC, so you can't really build an automatic tool that would check whether some code uses Promises or not. So you're left with manually code reviewing every 3rd party library you might want to use in your code.

Promises and async/await are a part of the language. It's the core dump that's the implementation detail, and I think that it isn't the job of V8 to yield to our needs, it's our job to yield to our users' needs.

jclulow commented 7 years ago

@jclulow That depends on how different the behaviour is. --abort-on-uncaught-exception is a behavioural change, too, but one that is small enough to not require testing of its own. Having something like that for Promises would be a good goal, even though I think it’s obvious that’s harder to achieve.

What was so serendipitous about --abort-on-uncaught-exception is that it only affects what we do when we were otherwise going to exit the program synchronously on an error condition. We actually do it with the entire stack still in tact, and the choice the user makes is really either to print that stack out to stderr (the default), or to abort and generate a core file instead.

By contrast, with promises, you basically can't ever know a priori whether a rejection will subsequently be handled -- at least not until you've already had to unwind the stack. To do otherwise would seem to be non-compliant with the specification, and will presumably break at least some software based on that specification. While I obviously think Promises are a bad idea, I also have extreme empathy for users of the things -- if you're going to use them, they should work the way they're documented and specified to work.

I also understand that they're a part of the language, and here to stay. So is the with block. This changes nothing about my firm belief that we should discourage people from using them where they also want to get sensible post mortem debugging support.

benjamingr commented 7 years ago

@addaleax

The bottleneck is disk I/O, creating a core dump is basically the kernel copying the raw memory of a process to a file. (I would be really surprised if there were OSes that supported copy on write for coredumps, but I didn’t check that.)

I saw there are at least some products that do this like https://code.google.com/archive/p/google-coredumper/

addaleax commented 7 years ago

@benjamingr Yeah, that looks like a straightforward userland implementation (for Linux/ELF), and Node could probably use that. It doesn’t do CoW, either, though (I would have no idea how that would be implementable in userland, and for the kernel it’s probably too much housekeeping overhead, too).

refack commented 7 years ago

Take a core dump when a rejection happens in your code, remove it when the rejection is handled.[1]

Meet-me-halfway solution

flag the VM keep n stacks just in case

sam-github commented 7 years ago

/cc @rnchamberlain, who was recently explaining how one way to get core dumps reasonably portably is to fork(2) and abort(3).

davepacheco commented 7 years ago

@sam-github Forking is dicey in multi-threaded programs, particularly if one component is not responsible for all of the threads. It's also dicey in a context where you know the program has encountered a non-operational error (i.e., a bug). You could cause the program to hang instead of either dumping core or proceeding on. Plus, important state about the program can change in the child process. This approach might be fine in some specific programs, but I wouldn't recommend it as a general approach.

refack commented 7 years ago

P.S. @benjamingr I fixed a typo in the 1st comment pint -> ping to

ljharb commented 7 years ago

side note: please don't name it "exception", as Promise.reject() isn't an exception - --abort-on-unhandled-rejection would be much clearer.

refack commented 7 years ago

@addaleax I misunderstood your request.

In my comment:

In windows we have a system call tostart a debugger on the living process, if one is registered so: na na na na na 😜

I meant Windows has a way to suspend a crashing process and trigger a "post-mortem" analysis while the process is still alive, it's called "JIT debugging". So there is less need for core dumps. This API can also be triggered programmatically like a reverse SIGUSR1 where the process is calling the debugger. So we could have

process.on('rejectionHandled uncaughtException unhandledRejection', () => {
   native-win32.callJITDebugger()
})

Which is a simple solution.

[editorial]Please take any comment of mine with an emoji as good-spirited, tongue-in-cheek, me trying to be funny or lighten up the conversation[/editorial]

davidmarkclements commented 7 years ago

For additional context around the challenges, there's a thread in the postmortem WG - https://github.com/nodejs/post-mortem/issues/16 - whilst a little dated it's summarises the problems with postmortems and promises well.

To go through the points

Take a core dump when a rejection happens in your code, remove it when the rejection is handled.

This could lead to a lot of overhead, also when do you decide a rejection is handled or not? handling can be async, so you have to wait a predefined amount of time, depending on how code is written this may var, (some kind of flag like --core-dump-on-definite-uncaught-rejection-timeout=100 summarises the setting). This approach is, to my eyes, quite convoluted. What effect will this have on the call stack?

Abort synchronously with a flag if there is no catch handler and the promise was created more than a microtask ago.

This ignores the ability to async attach a catch handler... that would have to be well documented, also, won't we lose the call stack (similar to aborting in a setImmediate)?

Solve the problem only for async functions - since those must declare catch handlers synchronously and syntactically, and recommend that people who want to use promises use them.

If we can do this in simple way without getting in the weeds with promises I'd be +1 on doing this as soon as possible and recommend async/await syntax in docs (also, from a syntactical/best practices position this is a general win). Not sure how this works with something like await Promise.all([p1, p2]) but that's a detail some way down the road. This path may reveal, help, inspire something for implementing core dumps for promises, or it may not. Either way, it gives us something

Expose a flag that violates the ECMAScript standard and schedules promises synchronously.

Very much against this, it could introduce breaking changes, it's can cause confusion, all my -1

Document that promises do not work with post-mortem debugging tools.

last resort (but a better option if combined with async/await core dumps)

spion commented 7 years ago

Here is the meat of the problem:

Promises mandate next-turn execution of any function passed to then
Due to this, promises must catch every single exception that runs in those functions, and at the very least delay the execution of any additional catch handlers by the next tick of the event loop.
By this time, the original call stack of the exception is gone unless preserved somehow. Additionally, the state of the heap may have been changed by other code that executed at the same time of the event loop. Due to both of the above, its not very useful to get a core dump

Some thoughts on the direction for the fixes, not sure if right. Looking for technical corrections if I got anything wrong:

We can't fix the problem by taking core dumps. Its too costly. Your heap is likely 1GB in size, and there will be a lot of uncaught exceptions flying around - in fact every single thrown error in a promise callback will mandate a dump because it must not be handled until the next tick. But perhaps the performance / availability of the copy on write approach can be investigated.
We can probably save the stack, if V8 provides a way to do it. Not sure about costs, but it will certainly be orders of magnitude less than a full core dump. The stack is definitely the more important part that needs to be preserved.
The next most important part are the references contained in the call stack that point to objects on the heap. Its unclear whether we can make use of them and/or how costly that might be. The GC moves things around all the time and/or collects them (e.g. new space -> old space). Maybe a V8 engineer could chime in on this?
Even if we do manage to track the references, the values they point to might change by the next tick. Perhaps it may be possible to also copy all the things that the stack is pointing to, but the costs are likely to grow very quickly. So it seems to me like this isn't really worth it - better to simply take the heap dump as it is, one tick later, call it "good enough", and deal with the fact that content may've been modified by things executing in the same tick.

Some additional ideas for performance:

the tooling could recognize where exceptions come from and be smart about not taking multiple snapshots.
alternatively tooling could wait for unhandled rejection events, and only then would start snapshotting all of the exceptions that look the same way (came from the same code). This adds difficulties for reproducibility (the event needs to happen twice) but they may be acceptable. However it can provide the memory contents at exception time, as it already gathered evidence that the same call stack of functions (as well as code lines) results with an uncaught exception.

mhdawson commented 7 years ago

As mentioned earlier in the thread I think promises are here so the task it figuring out the way forward given that reality. If enough of the people on this thread are going to be at the collaborator summit next week it might make sense to schedule a session to discuss in person.

benjamingr commented 7 years ago

@mhdawson I thought I could attend but work did not approve my days off nor did it budget time for it - so I'll be missing the coming summit :( I would love to remote-attend the said session or if we could do a hangouts I think it could be great.

mhdawson commented 7 years ago

@benjamingr if enough people are going to do that we'll try to add a session and get you connected in. If not I'm thinking we might want to schedule a hangout to get the interested parties together soon after.

spion commented 7 years ago

The more I think about it, the more this double-run idea sounds like a simple (if slightly inconvenient) solution to this whole problem:

At the first process run, collect call stack info about unhandled rejections. Record the functions, code files and line numbers of every unhandled rejection and store them for later use.
At the second run, track rejections even before they're unhandled, compare with previously recorded call stacks, and if there is a match, core-dump immediately. The memory at the time of the error should be correctly preserved.

bnoordhuis commented 7 years ago

We can probably save the stack, if V8 provides a way to do it.

Probably doesn't even need direct support from V8. The postmortem metadata that is part of the UNIX binaries is rich enough to decode JS stack frames and the objects in them.

mhdawson commented 7 years ago

Saving the stack would be a good first step and the double-run approach does sound like something we should take a look at. It does mean it might take longer to get the full debug info, but if is the only way then it much better than not being able to get it at all.

benjamingr commented 7 years ago

What's the status of async_hooks with promises?

addaleax commented 7 years ago

@benjamingr @matthewloring has a patch available that should work for making them compatible, so I think it’s mostly just a question of async_hooks getting ready. :)

mhdawson commented 7 years ago

FYI - work to investigate issues discussed here: https://github.com/nodejs/post-mortem/issues/45

benjamingr commented 7 years ago

Going to close this now since https://github.com/nodejs/post-mortem/issues/45 is happening. One last call for @misterdjules and anyone else from @nodejs/post-mortem to participate there if you'd like to help make promises work smoothly with post-mortem tools.

nodejs / CTC

Post Mortem Analysis with Async APIs #114