pinojs / sonic-boom

Extremely fast utf8 only stream implementation
MIT License
266 stars 41 forks source link

Infinite buffer growth when unable to flush in async mode #65

Closed mmarchini closed 3 years ago

mmarchini commented 3 years ago

Platform: Linux Version: 1.4.0 Node.js Version: v12.20.1

If sonic-boom is running on async mode and flush fails constantly with EAGAIN (for example, if writing to a named pipe where the process supposed to read from the pipe stops reading), the buffer will keep growing indefinitely without a way for the user to identify the issue (because EAGAIN is not propagated to the user). While not being able to flush for a long period of time is an indication that something elsewhere went terribly wrong, since EAGAIN is not propagated there's no way for the application to identify that the buffer is not getting flushed and is growing indefinitely.

I have two suggested features that would help users deal with this kind of istuation:

  1. Propagate EAGAIN, either as it's own event, or as part of error (would need to be optional and disabled by default, or that would be a breaking change)
  2. Provide a mechanism for users to drop data from the buffer (today it's possible to do so by making changes to _buf, but that's not a supported API and would involve many checks)

An alternative would be to privde a maxLength option and then let sonic-boom drop the oldest X bytes from the buffer. But emitting EAGAIN gives flexibility so that users can not only drop data, but also send alerts in perform other actions in cases where the issue is detected.

If the two suggestions above are acceptable I'm willing to send a pull request for them.

mmarchini commented 3 years ago

We could also make flushSync emit this error on async mode so that the user can backoff from attempting to write to an unwritable location (which causes any application in this situation to hang).

mcollina commented 3 years ago

Does the fd ever recover? Will it ever be possible to write to that destination?

I'm ok for your two main points anyway. As for the "error" behavior, I would emit an error for flush() or throw in case of flushSync() if the "wait" for EAGAIN goes over a maxEAGAIN option (or pick any name really).

(This is one more time where I thanks this library is out of Node.js core).

mmarchini commented 3 years ago

It could recover or not, depending on the issue. For example, if fd is a named pipe being read by a pino transport that sends data to a log aggregating service, and that service stops responding/returns only 500s for a while and the pino transport keeps retrying, fd will recover if/when the service recovers. In this case if the service doesn't recover, fd doesn't recover either. A situation where fd wouldn't recover is when the pino transport mistakenly stops reading from the named pipe (because of a coding mistake, for example).

As for the "error" behavior, I would emit an error for flush() or throw in case of flushSync() if the "wait" for EAGAIN goes over a maxEAGAIN option (or pick any name really).

I like the idea of setting a retry limit and emitting an error if it exceeds that limit. I'm not sure about flushSync() throwing though: if the user still wants to recover, alert or drop data when flushSync fails, that would need to be done on a uncaughtException handler, which seems less than ideal since it's a catch-all errors in Node.js. It's worth noting that in some cases the code might not be aware that flushSync is being called (finalLogger instance comes to mind), so it's not as simple as adding try/catches wherever possible. Maybe for sonic-boom running with sync: false, flushSync also emits an error instead of throwing? Since that would be a breaking change anyway I'll start by sending a PR where flushSync throws if retries are exceeded.

As for naming the option, what about maxFlushRetries?

mcollina commented 3 years ago

That works for me. We are prepping pino@7, so it's a good time for a major bump.

That3Percent commented 3 years ago

We're running into a crash that we strongly suspect is due to this issue.

Is there any workaround available before this major release?

mcollina commented 3 years ago

We do not have a fix for this even in a minor. Would you like to send a PR @That3Percent?

That3Percent commented 3 years ago

We do not have a fix for this even in a minor. Would you like to send a PR @That3Percent?

Probably not. (Sorry)

kibertoad commented 3 years ago

@mmarchini Would you still consider working on this?

mmarchini commented 3 years ago

Still on my list but didn't have the bandwidth to work on it yet