Some Cameras Stop Recording

jlpoolen commented 2 years ago

I have 5 Reolink cameras feeding into moonfire-nvr. 2 of the cameras (same model) stopped appearing in the LiveView. If I stop and start moonfire-nvr, they reappear and then after some time drop out. I've captured a log for an interval where the two cameras, GarageEast & GarageWest, dropped out.

The log runs for 24 hours from Sun 26 Dec 2021 09:57:43 AM PST to Mon 27 Dec 2021 09:58:08 AM PST. Here is a link to the log file that is being temporarily stored on Google Drive: https://drive.google.com/file/d/1yhoQtxqLSjDYEPxlDrvgz0_Z8F-xpmq2/view?usp=sharing

Tip: line 90,342 is at time point D20211226 20:00:02.449 Also, I restarted Monday at 8:38 a.m. (line 196,486) and both cameras appeared in the LiveView.

The drop-out occurs around 8:00 p.m. Sun. 12/26

Raspberry Pi4:

 jlpoole@pi:~/temp $ uname -a
 Linux pi 5.10.63-v7l+ #1457 SMP Tue Sep 28 11:26:14 BST 2021 armv7l GNU/Linux
 jlpoole@pi:~/temp $

Moonfire-nvr:

jlpoole@pi:/usr/local/src/moonfire-nvr $ date; git remote update
Mon 27 Dec 2021 11:26:52 AM PST
Fetching origin
remote: Enumerating objects: 66, done.
remote: Counting objects: 100% (66/66), done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 48 (delta 38), reused 27 (delta 20), pack-reused 0
Unpacking objects: 100% (48/48), done.
From https://github.com/scottlamb/moonfire-nvr
   a7c574e..5c7263b  master     -> origin/master
jlpoole@pi:/usr/local/src/moonfire-nvr $ date; git status -uno
Mon 27 Dec 2021 11:27:03 AM PST
On branch master
Your branch is behind 'origin/master' by 9 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   ui/package-lock.json

no changes added to commit (use "git add" and/or "git commit -a")
jlpoole@pi:/usr/local/src/moonfire-nvr $

Here is the inventory shown through the web interface showing a drop-out of the main camera GarageEast around 8:00 p.m.; there should be files from 8:00 p.m. until 11:59: firefox_2021-12-27_11-08-05 Here is the inventory showing a drop-out of the sub feed for cameras GarageEast around 8:00 p.m.; there should be files from 8:00 p.m. until 11:59: firefox_2021-12-27_11-09-06

Here's evidence that another camera's feed was saved under the same instance of moonfire: firefox_2021-12-27_11-44-17

I could not find anything in the log that suggests a stoppage. Please let me know if there are some key words I should search for in the future. I do have other logs prior to this incident if they would be helpful.

scottlamb commented 2 years ago

Hmm, I see these log lines:

I20211226 19:53:37.461 s-GarageWest-sub moonfire_nvr::streamer] GarageWest-sub: waiting up to 0ns for TEARDOWN or expiration of 1 stale sessions

I20211226 20:01:01.574 s-GarageEast-sub moonfire_nvr::streamer] GarageEast-sub: waiting up to 0ns for TEARDOWN or expiration of 1 stale sessions

and I don't see a follow-up done waiting for either of them. That's wrong. The 0ns is also suspect—it suggests the expiration time has already passed but the teardown code hasn't run yet. Maybe it got stuck in some way, like something's holding the lock on the session group. There's a session group per camera (not per stream), so the main stream is also relevant. Last logged on those:

I20211226 19:14:36.280 s-GarageWest-main moonfire_nvr::streamer] GarageWest-main: waiting up to 63.999378818s for TEARDOWN or expiration of 1 stale sessions

I20211226 19:34:43.147 s-GarageEast-main moonfire_nvr::streamer] GarageEast-main: waiting up to 63.998807037s for TEARDOWN or expiration of 1 stale sessions

Those waits likewise never completed. Odd.

All of these waits seem to be preceded by lines that suggest teardown_loop_forever already completed, eg:

D20211226 20:01:00.806 tokio-runtime-worker retina::client::teardown] TEARDOWN 43CFE9BB on existing conn succeeded (status 200).

But background_teardown seemingly didn't complete, even though its remaining code looks simple. Again, it's as if something is holding the lock. But I'm looking at all the places that hold the lock, and they get in, do their thing without any waits, and get out.

scottlamb commented 2 years ago

Is this happening regularly? If I added some extra logging to figure this out, would it trigger in fairly short order?

jlpoolen commented 2 years ago

Yes, this seems to happen within 8 hours, and it seems to affect the same cameras. It's been happening for several weeks now, I think. I've just found restarting the server takes care of the problem. I was wondering if the cameras might be adjusting their times enough that it is throwing something out of kilter. I have all cameras using the same time server, e.g. "B".

Are the locks maintained in a table or on the file system? I would query against table in a read-only connection while moonfire-nvr is running. A patch which I can apply or a branched version is fine if you want me to rebuild and run.

scottlamb commented 2 years ago

The answer came to me in the middle of the night. Writing it down before I forget it:

The locking is fine. It's a problem introduced in Retina f682d75 (between Retina v0.3.1 and v0.3.2, so at Moonfire NVR 095417bb) that only happens with old live555 servers. I switched how sessions get cleaned up. Before, every time the caller checked on stale sessions, it'd clean up anything expired. After, sessions launch a background task to clean themselves up. But I forgot about the "TCP sessions discovered via unexpected RTSP interleaved data packets" (case 2 in the SessionGroup API docs). They don't launch a task like that, and so they never get cleaned up.

You were using UDP until fairly recently, right? It might be several days before I fix this because my kids are on break. You could try switching back to UDP in the meantime. I don't think this will happen if the live555 server has no TCP clients. (Note it can happen if this client is using UDP and another client is using TCP.)

btw, the lock I'm describing here is just a mutex in RAM.

jlpoolen commented 2 years ago

I switched back to TCP to help troubleshoot potential network issues. I will engage UDP and/or use TCP knowing of this potential problem and using my work-around of resetting the server. Thank you

scottlamb commented 2 years ago

Please let me know if this fix works and send new debug logs if not.

jlpoolen commented 2 years ago

building now and will launch shortly with new revisions. Will advise tomorrow as that should be sufficient time for the problem to manifest itself.

jlpoolen commented 2 years ago

Checked my Live View (TCP packets) and all cameras came up. The failure of a camera to appear in Live View was an indicator of a failure. Since all cameras appeared, I'm concluding the fix has worked. I also checked random cameras for recordings and their subs and they all were populated with entries including most recent.

Thank you!

jlpoolen commented 2 years ago

Confirming I'm stilling running the same process started 2 days ago when I last posted and my refresh of Live View continues to show all cameras, so this is further confirmation that the problem of camera drop-out has been corrected. Thank you, again!

scottlamb / moonfire-nvr

Some Cameras Stop Recording #184