Event emitter memory leak when a camera fails leads to no data being saved to disk

drsvpla commented 2 years ago

We have been seeing an issue on the v3.0.0 branch where, if one camera fails (i.e. due to actual hardware failure), NVR-JS very quickly stops writing any new metadata for the other still-functioning cameras, and after a few hours usually stops writing new video files for the still functioning cameras as well. The live streams still work, just that the actual persistent storage is no longer being written. Restarting NVRJS gets video files to work again for a short period but it always stops within a few hours.

When this happens, we get errors like this:

(node) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [Server]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
(node) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 listening listeners added to [Server]. Use emitter.setMaxListeners() to increase limit
(node) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 upgrade listeners added to [Server]. Use emitter.setMaxListeners() to increase limit

and

Error: ENOENT: no such file or directory, open '<path>/camera2/<timestamp>.mp4'
    at Object.openSync (fs.js:498:3)
    at Object.readFileSync (fs.js:394:35)
    at CreateMeta (<path>/nvr-js/NVRJS.js:447:24)
    at Socket.<anonymous> (<path>/nvr-js/NVRJS.js:627:4)
    at Socket.emit (events.js:376:20)
    at addChunk (internal/streams/readable.js:309:12)
    at readableAddChunk (internal/streams/readable.js:284:9)
    at Socket.Readable.push (internal/streams/readable.js:223:10)
    at Pipe.onStreamRead (internal/stream_base_commons.js:188:23) {
  errno: -2,
  syscall: 'open',
  code: 'ENOENT',
  path: <path>/NVRJS_SYSTEM/camera2/<timestamp>.mp4'
}

Tracing the emitter leak led to the const Socket = io(HTTP, IOptions); line in respawn at around line 605 in NVRJS.js.

I attempted to close the socket in the .on('close' handler a few lines later (to match the other cleanup), but that led the the entire web server being stopped, which obviously isn't ideal.

After a little digging, I came up with a work-around that seems to fix our issue (patch attached). This might not be the best fix, and I am not familiar enough with the codebase to be sure if this has any other unintended side effects. But we've been running it for about 18 hours on half a dozen systems with camera failures and they are properly writing metadata and video files now, and we're not seeing any issues on healthy systems either. I realize that's not a particularly long time to test for stability, but I figured it was worth writing up the issue quickly just in case there was a better fix, and I can report back on longer-term stability later.

Obviously the best fix is to repair or replace the broken cameras, but that isn't always something that can be done immediately, so continuing to write out data from the still working cameras while in that state is important.

marcus-j-davies commented 2 years ago

Hi @drsvpla Thanks for the diff - I'll attach it here for reference also.

v3 is still being worked on, so please let me know if your patch still holds up after some days, and I'll review any side affects that I will need to take care of.

thanks for the contribution. 👍

diff.txt

drsvpla commented 2 years ago

Sounds good. "Luckily" (not luckily) we have several systems with failed cameras, so that should hopefully increase the odds of one of them hitting any problems. But so far so good on all of them, which is a huge improvement. Hoping to do a fleet-wide deploy of this code soon to get more coverage on healthy systems too, right now this is just on the systems that need it + a test device with working cameras (though in theory, if all the cameras start up and stay working, healthy systems shouldn't see any changes with this code).

drsvpla commented 2 years ago

6 days in, and we're not seeing any obvious problems with this patch in place. Running smoothly on 6 systems with one failed camera each (and 2 working cameras each), and a couple systems with 3 healthy cameras as a control. All are still writing files (video and metadata) flawlessly for the cameras that work, web interface seems to be working fine, etc. The metrics we log all look in line with unpatched systems -- no obvious abnormal memory or CPU usage, no crashes, etc. Without this, systems with a failed camera would usually stop writing files within a few hours, so getting 6 days of consistent usable video out of them is definitely encouraging.

If you want any specific tests run or logs taken on systems with failed cameras, I can run them easily. But in terms of general stability and basic functionality, this seems to be working.

marcus-j-davies commented 2 years ago

Nice work @drsvpla I will pull these changes in to V3 (and that of your PR around the same time).

I'll leave this open until I have utilised this patch 👍

marcus-j-davies commented 2 years ago

@drsvpla

I have included this patch with your PR (merged) Thanks

marcus-j-davies / nvr-js

Event emitter memory leak when a camera fails leads to no data being saved to disk #11