CPU pegged at 100% due to sonic-boom trying to write empty buffer

mmarchini commented 1 year ago

This is a very weird bug I'm still tracking down. Some of our instances started have their CPUs pegged at 100%. When we looked at FlameGraphs, sonic-boom was using a huge chunk of CPU:

This code path is not CPU intensive so I assumed it was being called way too frequently. After connecting a debugger to the process and looking at the variables when release is called, I got:

Watchers:
  0: n = 0
  1: this =
    { _bufs: Array(0),
      _len: 17740,
      fd: 19,
      _writing: true,
      _writingBuf: '',
      ... }
  2: this._len = 17740
  3: this._bufs = []
  4: this._writingBuf = ''

which explains why release/actualWrite are being called constantly: with this._len being above zero and the buffers being empty, we keep calling fs.write which returns n=0 and so we don't subtract anything from this._len so we just keep calling it infinitely. I don't have a repro yet but I'm working on it. If anyone has any idea on how to force this scenario to happen, I'm happy to hear (my guess is it happens in a specific scenario after an EAGAIN in async mode, but I'm not sure).

mcollina commented 1 year ago

I would start taking a look at https://github.com/pinojs/sonic-boom/blob/c0e57f2000e1dd13f9d497c8f00ed805b330fe68/index.js#L163. Are you logging multi-bytes characters at all?

mmarchini commented 1 year ago

from what I can see there are no multi-byte characters being logged by this application. I still don't have a repro but I think we could fix the issue ~by adding a check on actualWrite to ensure we're not calling fs.write[Sync] when sonic._writingBuf.length === 0~ resetting this._len to 0 if both this._writingBuf.length and this._bufs.length are both zero. Would that change be acceptable?

mcollina commented 1 year ago

Do you (or possibly pino itself) add a error event handler to sonic-boom? That's the only condition were this can happen.

I'd be up for a reset fix if accompanied by a test (even with mocks).

mmarchini commented 1 year ago

The app still uses domains (i know i know), so in practice the same as adding an error handler right? That would explain why we're seeing it on this app and not on other ones (since the other ones don't use domains).

Yeah I already have some ideas for a test. Knowing it could be error related helps a lot.

mcollina commented 1 year ago

The app still uses domains (i know i know), so in practice the same as adding an error handler right? That would explain why we're seeing it on this app and not on other ones (since the other ones don't use domains).

Yes, exactly.

mmarchini commented 1 year ago

After more debugging and some excessive use of console.logs, I was able to track down the issue to: https://github.com/pinojs/sonic-boom/blob/master/index.js#L353-L380. What's happening is writeSync can't write the whole buffer in one call, so it returns bytes written which is less than buf.len, but SonicBoom doesn't check that and moves on to the next buffer in _bufs, causing partial buffers to be written, and since we're subtracting the length of these partial buffers from this._len, we end up with a _len bigger than the actual number of buffers left to be written. For pino users this can happen when using log.fatal, since that'll always call flushSync.

I can send a pull request to fix that and to add some tests, I just have a few questions:

If we're in the middle of async writing and we call flushSync, but we haven't finished writing what's in _writingBuf, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf but we'll async write what's in _writingBuf afterwards). Is that a known issue? If not, is there a way we can prevent it?
Should I still add the reset on release if this is fixed? IMO yes to make the code more robust, but I'd like your take on it
Should flushSync call release at the end (since release has a lot of "resetting state" and cleanup code)?

mcollina commented 1 year ago

If we're in the middle of async writing and we call flushSync, but we haven't finished writing what's in _writingBuf, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf but we'll async write what's in _writingBuf afterwards). Is that a known issue? If not, is there a way we can prevent it?

There is no escaping from a write-out-of-order in this case. If flushSync() is called in that situation it should on a last-call scenario (the process is crashing), so some data logged is better than nothing.

If we're in the middle of async writing and we call flushSync, but we haven't finished writing what's in _writingBuf, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf but we'll async write what's in _writingBuf afterwards). Is that a known issue? If not, is there a way we can prevent it?

I think so, but it really dependens if this breaks anything else.

Should flushSync call release at the end (since release has a lot of "resetting state" and cleanup code)?

Possibly yes, it might fix all things. However, it scares me a little for possible side effects.

mmarchini commented 1 year ago

Thanks, I'll look into those items at a later point, for now I opened #164 just to fix the immediate bug.

pinojs / sonic-boom

CPU pegged at 100% due to sonic-boom trying to write empty buffer #162