Closed mmarchini closed 1 year ago
I would start taking a look at https://github.com/pinojs/sonic-boom/blob/c0e57f2000e1dd13f9d497c8f00ed805b330fe68/index.js#L163. Are you logging multi-bytes characters at all?
from what I can see there are no multi-byte characters being logged by this application. I still don't have a repro but I think we could fix the issue ~by adding a check on actualWrite
to ensure we're not calling fs.write[Sync]
when sonic._writingBuf.length === 0
~ resetting this._len
to 0 if both this._writingBuf.length
and this._bufs.length
are both zero. Would that change be acceptable?
Do you (or possibly pino itself) add a error
event handler to sonic-boom? That's the only condition were this can happen.
I'd be up for a reset fix if accompanied by a test (even with mocks).
The app still uses domains (i know i know), so in practice the same as adding an error
handler right? That would explain why we're seeing it on this app and not on other ones (since the other ones don't use domains).
Yeah I already have some ideas for a test. Knowing it could be error
related helps a lot.
The app still uses domains (i know i know), so in practice the same as adding an error handler right? That would explain why we're seeing it on this app and not on other ones (since the other ones don't use domains).
Yes, exactly.
After more debugging and some excessive use of console.logs
, I was able to track down the issue to: https://github.com/pinojs/sonic-boom/blob/master/index.js#L353-L380. What's happening is writeSync
can't write the whole buffer in one call, so it returns bytes written which is less than buf.len
, but SonicBoom doesn't check that and moves on to the next buffer in _bufs
, causing partial buffers to be written, and since we're subtracting the length of these partial buffers from this._len
, we end up with a _len
bigger than the actual number of buffers left to be written. For pino users this can happen when using log.fatal
, since that'll always call flushSync
.
I can send a pull request to fix that and to add some tests, I just have a few questions:
flushSync
, but we haven't finished writing what's in _writingBuf
, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf
but we'll async write what's in _writingBuf
afterwards). Is that a known issue? If not, is there a way we can prevent it?release
if this is fixed? IMO yes to make the code more robust, but I'd like your take on itflushSync
call release
at the end (since release
has a lot of "resetting state" and cleanup code)?If we're in the middle of async writing and we call flushSync, but we haven't finished writing what's in _writingBuf, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf but we'll async write what's in _writingBuf afterwards). Is that a known issue? If not, is there a way we can prevent it?
There is no escaping from a write-out-of-order in this case. If flushSync()
is called in that situation it should on a last-call scenario (the process is crashing), so some data logged is better than nothing.
If we're in the middle of async writing and we call flushSync, but we haven't finished writing what's in _writingBuf, we might end up in a situation where we write things out of order (since we'll synchronously write what's in _buf but we'll async write what's in _writingBuf afterwards). Is that a known issue? If not, is there a way we can prevent it?
I think so, but it really dependens if this breaks anything else.
Should flushSync call release at the end (since release has a lot of "resetting state" and cleanup code)?
Possibly yes, it might fix all things. However, it scares me a little for possible side effects.
Thanks, I'll look into those items at a later point, for now I opened #164 just to fix the immediate bug.
This is a very weird bug I'm still tracking down. Some of our instances started have their CPUs pegged at 100%. When we looked at FlameGraphs,
sonic-boom
was using a huge chunk of CPU:This code path is not CPU intensive so I assumed it was being called way too frequently. After connecting a debugger to the process and looking at the variables when
release
is called, I got:which explains why
release
/actualWrite
are being called constantly: withthis._len
being above zero and the buffers being empty, we keep callingfs.write
which returnsn=0
and so we don't subtract anything fromthis._len
so we just keep calling it infinitely. I don't have a repro yet but I'm working on it. If anyone has any idea on how to force this scenario to happen, I'm happy to hear (my guess is it happens in a specific scenario after an EAGAIN in async mode, but I'm not sure).