Closed jlpoolen closed 2 years ago
Hmm, I see these log lines:
I20211226 19:53:37.461 s-GarageWest-sub moonfire_nvr::streamer] GarageWest-sub: waiting up to 0ns for TEARDOWN or expiration of 1 stale sessions
I20211226 20:01:01.574 s-GarageEast-sub moonfire_nvr::streamer] GarageEast-sub: waiting up to 0ns for TEARDOWN or expiration of 1 stale sessions
and I don't see a follow-up done waiting
for either of them. That's wrong. The 0ns is also suspect—it suggests the expiration time has already passed but the teardown code hasn't run yet. Maybe it got stuck in some way, like something's holding the lock on the session group. There's a session group per camera (not per stream), so the main stream is also relevant. Last logged on those:
I20211226 19:14:36.280 s-GarageWest-main moonfire_nvr::streamer] GarageWest-main: waiting up to 63.999378818s for TEARDOWN or expiration of 1 stale sessions
I20211226 19:34:43.147 s-GarageEast-main moonfire_nvr::streamer] GarageEast-main: waiting up to 63.998807037s for TEARDOWN or expiration of 1 stale sessions
Those waits likewise never completed. Odd.
All of these waits seem to be preceded by lines that suggest teardown_loop_forever
already completed, eg:
D20211226 20:01:00.806 tokio-runtime-worker retina::client::teardown] TEARDOWN 43CFE9BB on existing conn succeeded (status 200).
But background_teardown
seemingly didn't complete, even though its remaining code looks simple. Again, it's as if something is holding the lock. But I'm looking at all the places that hold the lock, and they get in, do their thing without any waits, and get out.
Is this happening regularly? If I added some extra logging to figure this out, would it trigger in fairly short order?
Yes, this seems to happen within 8 hours, and it seems to affect the same cameras. It's been happening for several weeks now, I think. I've just found restarting the server takes care of the problem. I was wondering if the cameras might be adjusting their times enough that it is throwing something out of kilter. I have all cameras using the same time server, e.g. "B".
Are the locks maintained in a table or on the file system? I would query against table in a read-only connection while moonfire-nvr is running. A patch which I can apply or a branched version is fine if you want me to rebuild and run.
The answer came to me in the middle of the night. Writing it down before I forget it:
The locking is fine. It's a problem introduced in Retina f682d75
(between Retina v0.3.1
and v0.3.2
, so at Moonfire NVR 095417bb
) that only happens with old live555 servers. I switched how sessions get cleaned up. Before, every time the caller checked on stale sessions, it'd clean up anything expired. After, sessions launch a background task to clean themselves up. But I forgot about the "TCP sessions discovered via unexpected RTSP interleaved data packets" (case 2 in the SessionGroup API docs). They don't launch a task like that, and so they never get cleaned up.
You were using UDP until fairly recently, right? It might be several days before I fix this because my kids are on break. You could try switching back to UDP in the meantime. I don't think this will happen if the live555 server has no TCP clients. (Note it can happen if this client is using UDP and another client is using TCP.)
btw, the lock I'm describing here is just a mutex in RAM.
I switched back to TCP to help troubleshoot potential network issues. I will engage UDP and/or use TCP knowing of this potential problem and using my work-around of resetting the server. Thank you
Please let me know if this fix works and send new debug logs if not.
building now and will launch shortly with new revisions. Will advise tomorrow as that should be sufficient time for the problem to manifest itself.
Checked my Live View (TCP packets) and all cameras came up. The failure of a camera to appear in Live View was an indicator of a failure. Since all cameras appeared, I'm concluding the fix has worked. I also checked random cameras for recordings and their subs and they all were populated with entries including most recent.
Thank you!
Confirming I'm stilling running the same process started 2 days ago when I last posted and my refresh of Live View continues to show all cameras, so this is further confirmation that the problem of camera drop-out has been corrected. Thank you, again!
I have 5 Reolink cameras feeding into moonfire-nvr. 2 of the cameras (same model) stopped appearing in the LiveView. If I stop and start moonfire-nvr, they reappear and then after some time drop out. I've captured a log for an interval where the two cameras, GarageEast & GarageWest, dropped out.
The log runs for 24 hours from Sun 26 Dec 2021 09:57:43 AM PST to Mon 27 Dec 2021 09:58:08 AM PST. Here is a link to the log file that is being temporarily stored on Google Drive: https://drive.google.com/file/d/1yhoQtxqLSjDYEPxlDrvgz0_Z8F-xpmq2/view?usp=sharing
Tip: line 90,342 is at time point D20211226 20:00:02.449 Also, I restarted Monday at 8:38 a.m. (line 196,486) and both cameras appeared in the LiveView.
The drop-out occurs around 8:00 p.m. Sun. 12/26
Raspberry Pi4:
Moonfire-nvr:
Here is the inventory shown through the web interface showing a drop-out of the main camera GarageEast around 8:00 p.m.; there should be files from 8:00 p.m. until 11:59:
Here is the inventory showing a drop-out of the sub feed for cameras GarageEast around 8:00 p.m.; there should be files from 8:00 p.m. until 11:59:
![firefox_2021-12-27_11-09-06](https://user-images.githubusercontent.com/227313/147501688-b2572051-f7a1-4664-a29f-5207d386ebb8.png)
Here's evidence that another camera's feed was saved under the same instance of moonfire:![firefox_2021-12-27_11-44-17](https://user-images.githubusercontent.com/227313/147502758-5ddd0971-fcaf-41e0-95ae-4cc7abf6a761.png)
I could not find anything in the log that suggests a stoppage. Please let me know if there are some key words I should search for in the future. I do have other logs prior to this incident if they would be helpful.