scottlamb / moonfire-nvr

Moonfire NVR, a security camera network video recorder
Other
1.19k stars 138 forks source link

Camera stops recording after a few days #209

Closed robin-thoni closed 2 years ago

robin-thoni commented 2 years ago

Describe the bug One of my camera (the same as in #197) stops recording after a few days with the following logs:

moonfire_1  | I20220328 20:09:03.324 sync-/var/lib/moonfire-nvr/media/main moonfire_db::db] Flush 10362 (why: 120 sec after start of 59 seconds cam01-rt02-main recording 1/30498):
moonfire_1  | /var/lib/moonfire-nvr/media/main: added 5M 859K 280B in 2 recordings (1/30498, 1/30499), deleted 0B in 0 (), GCed 0 recordings ().
moonfire_1  | /var/lib/moonfire-nvr/media/sub: added 2M 812K 120B in 1 recordings (2/30498), deleted 4M 215K 99B in 2 (2/30481, 2/30482), GCed 2 recordings (2/30479, 2/30480).
moonfire_1  | W20220328 20:09:38.034 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: sleeping for PT1S after error: stream ended
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | W20220328 20:09:38.039 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: sleeping for PT1S after error: stream ended
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | I20220328 20:09:39.034 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:39.039 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | W20220328 20:09:39.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: sleeping for PT1S after error: [172.16.12.2:33708(me)->10.15.9.1:554@2022-03-28T20:09:39, 1302@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=0000b8c6] Wrong ssrc; expecting ssrc=Some(76bbcb4e) seq=Some(b8c6)
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | W20220328 20:09:39.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: sleeping for PT1S after error: [172.16.12.2:33706(me)->10.15.9.1:554@2022-03-28T20:09:39, 39474@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=00006747] Out-of-order packet or large loss; expecting ssrc=Some(11339322) seq=Some(b8e4)
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | I20220328 20:09:40.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:40.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: waiting up to 63.999408706s for TEARDOWN or expiration of 1 stale sessions
moonfire_1  | I20220328 20:09:40.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:40.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: waiting up to 63.991108989s for TEARDOWN or expiration of 1 stale sessions
moonfire_1  | I20220328 20:10:23.392 sync-/var/lib/moonfire-nvr/media/sub moonfire_db::db] Flush 10363 (why: 120 sec after start of 59 seconds cam01-rt02-sub recording 2/30499):
moonfire_1  | /var/lib/moonfire-nvr/media/main: added 1M 698K 287B in 1 recordings (1/30500), deleted 0B in 0 (), GCed 0 recordings ().
moonfire_1  | /var/lib/moonfire-nvr/media/sub: added 3M 602K 854B in 2 recordings (2/30499, 2/30500), deleted 2M 337K 240B in 1 (2/30483), GCed 2 recordings (2/30481, 2/30482).

Expected behavior The camera should record indefinitely, or restart if the stream dies for some reason.

Screenshots Screenshot from 2022-03-30 10-31-38 Screenshot from 2022-03-30 10-31-25 Screenshot from 2022-03-30 10-33-59

Server (please complete the following information):

$ docker ps | grep moonfire
ddfdaeb31c49        scottlamb/moonfire-nvr:v0.7.1                     "/usr/local/bin/moon…"    2 days ago          Up 35 minutes           127.0.0.1:35280->8080/tcp                                                                             docker-moonfire-server_moonfire_1

$ docker images | grep moonfire
scottlamb/moonfire-nvr                          v0.7.1                  d124dc9d87cc        3 months ago        755MB

Camera (please complete the following information): Model: Trendnet TV-IP311PI Firmware Version: V5.0.1 140303

scottlamb commented 2 years ago

Are you using TCP or UDP to talk with the cameras? (The default is TCP, so it's TCP unless you picked UDP within the camera setup dialog.)

sleeping for PT1S after error: stream ended

This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.

sleeping for PT1S after error: [172.16.12.2:33708(me)->10.15.9.1:554@2022-03-28T20:09:39, 1302@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=0000b8c6] Wrong ssrc; expecting ssrc=Some(76bbcb4e) seq=Some(b8c6)

I might add some more debug logging to narrow this down. Not sure how familiar you are with RTSP, but it's layered on top of another protocol, RTP. RTP packets all have a ssrc that more or less identifies the stream. Moonfire was expecting one ssrc (either because it was promised in a RTSP header or because a previous packet used that ssrc; extra debug logging could tell us which) and got a packet for a different one.

When this happens, Moonfire suspects it's due to a bug in old versions of a popular RTSP server implementation called live555. In short, the server doesn't tear down TCP sessions properly, and the camera can end up sending several of them over the same connection, which cascades into problems with overload and then another bug that causes the connection to become corrupt after the TCP window camera's kernel TCP send buffer fills. There's a whole saga about this at https://github.com/scottlamb/retina/issues/17 .

Moonfire tries to avoid this cascade by waiting for the old session to go away before trying again, which is why you see this message:

waiting up to 63.999408706s for TEARDOWN or expiration of 1 stale sessions

What happens after that? There should be another attempt around 20:10:44.

robin-thoni commented 2 years ago

I'm using TCP explicitly in the config

What happens after that? There should be another attempt around 20:10:44.

Nothing, that's the end of the logs

This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.

If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?

scottlamb commented 2 years ago

Nothing, that's the end of the logs

😞

It kept running but didn't log any more? Was this the only camera Moonfire was configured to record?

If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?

No, that part is definitely broken then.

This sounds similar to #184, which I'd thought I'd fixed. I'll take another look.

scottlamb commented 2 years ago

Reread through the code in question, nothing's jumping out at me yet.

If you don't mind, it might help to enable more logging. Where there are --env lines in your nvr wrapper script, you can add one like --env=MOONFIRE_LOG=info,moonfire_nvr=debug,retina=debug, then do a nvr stop; nvr rm; nvr run to make it take effect. When a connection is dropped, there should then be a sequent of debug log lines from retina::client::teardown that narrow down the problem.

robin-thoni commented 2 years ago

Yeah, it kept running and recording the second camera. Just no recordings/logs for that one anymore.

I just enabled the logs, and I'm now getting logs like this:

moonfire_1  | D20220401 10:23:19.197 tokio-runtime-worker retina::client] SETUP response: Response {
moonfire_1  |     version: V1_0,
moonfire_1  |     status: Ok,
moonfire_1  |     reason_phrase: "OK",
moonfire_1  |     headers: Headers(
moonfire_1  |         {
moonfire_1  |             HeaderName(
moonfire_1  |                 "CSeq",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "3",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Date",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "Fri, Apr 01 2022 10:23:19 GMT",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Session",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "1353878211;timeout=60",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Transport",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=32887ea3;mode=\"play\"",
moonfire_1  |             ),
moonfire_1  |         },
moonfire_1  |     ),
moonfire_1  |     body: b"",
moonfire_1  | }

Is that what you're looking for? If so, I'll send you everything once it happens again

robin-thoni commented 2 years ago

Ah, second problem: the second camera has actually stopped recording since I restarted Moonfire on March 19th:

Screenshot from 2022-04-01 10-28-53

Screenshot from 2022-04-01 10-32-15

As you can see, there's a 0/1 second recording, then nothing. I've restarted multiple time just right now, but the recording won't restart. The camera is working/streaming fine outside of Moonfire.

I attached the logs since I restarted a few minutes ago.

Probably a different bug, though?

scottlamb commented 2 years ago

Ahh! Got it. Those logs helped, and I believe both cameras' problems are due to the same bug.

From this log line:

W20220401 10:23:19.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: sleeping for PT1S after error: [172.16.12.2:54278(me)->10.15.9.2:554@2022-04-01T10:23:19, 1302@2022-04-01T10:23:19 ch=0, stream=0, ssrc=32887ea3, seq=000090b1] Wrong ssrc; expecting ssrc=Some(4c29bf0c) seq=Some(90b1)

I see that it reached this code here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/rtp.rs#L135

in which it sees a strange ssrc and assumes it's due to a stale live555 tcp bug session, as described here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L70-L74

And we see that Moonfire is waiting for this session to go away:

I20220401 10:23:20.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: waiting up to 63.999485425s for TEARDOWN or expiration of 1 stale sessions

but it never does [edit: removed confused ideas based on looking at the wrong version of the code] because you're running Moonfire v0.7.1, which predates my fix to #184.

Also, your camera isn't advertising itself as using a buggy version of live555 (even though it is using such a version, or some other software with a similar bug), or we'd have seen this line logged:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L1320-L1322

Anyway, I think there are several changes to make to improve this situation:

In the meantime, a couple ways you can work around this problem:

robin-thoni commented 2 years ago

Ah, glad it helped!

Thanks for taking for the time to chase that bug! Unfortunately, I have little to no knowledge to help you to fix it...

By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.

jlpoolen commented 2 years ago

Hi Robin, "Poc"?? and could you expound upon your implementation of an alert system? Perhaps through the mailing list for this project? Thank you, John

robin-thoni commented 2 years ago

Hi John, I meant Proof of Concept. That's a very simple simple gateway that will authenticate to Moonfire and do the following:

The second part is a Zabbix template to connect Zabbix to it. That's the end result:

Screenshot from 2022-04-01 18-59-45

Screenshot from 2022-04-01 19-00-59

I plan to release it shortly

scottlamb commented 2 years ago

Unfortunately, I have little to no knowledge to help you to fix it...

No worries. If you'd like to learn more (about Rust, RTSP, etc.), let me know. Otherwise, I've got this.

By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.

Neat! I don't know Zabbix, but in general: I'm open to adding support for white-box monitoring to Moonfire itself also. We can discuss more in a separate issue. I've been meaning for a while to export metrics in the Prometheus exposition format. It looks like Zabbix supports this format also.

scottlamb commented 2 years ago

Actually, come to think of it, even if we did go on to launch the background task, that only tears down sessions with a known id, not these special live555 unknown session id stale sessions. My fix for https://github.com/scottlamb/moonfire-nvr/issues/184 wasn't complete. I started sending TEARDOWN requests for TCP sessions when we know the camera is buggy, which I guess has mostly avoided the problem since for @jlpoolen . But once a stale session with unknown id is noted, we never clean it up. Oops.

Huh. On second (third? fourth? I've lost track) look, I have code that's supposed to do this here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L259-L280

and it seems to be working in my best attempt to reproduce locally (an old Reolink camera with this bug, a hardcoded change to TeardownPolicy::Never, quick restart so there's a bad session to clean up). Huh. I may just have to release a version with more logging to figure this out.

robin-thoni commented 2 years ago

I may just have to release a version with more logging to figure this out.

Simply push a docker image on a different name, I'll run and try it until it fails. As you can see, it can take 1 to 2 weeks.

I'm open to adding support for white-box monitoring to Moonfire itself also

Nice to hear that! I'll have a deeper look then.

scottlamb commented 2 years ago

Oh! Moonfire NVR 0.7.1 is older than I realized. It predates my fix for #184. I should have been looking at Retina version v0.3.4 rather than latest when trying to figure this out. That explains why it didn't recover after 65 seconds. If you upgrade to the latest Moonfire NVR (note you'll need to adapt your nvr script and create a /etc/moonfire-nvr.toml file), this behavior will be better.

The other items in my checklist still apply.

robin-thoni commented 2 years ago

I'll do the upgrade (tomorrow?) and see if it helps.

BTW, would you mind making use of Github releases? That makes it much easier to track new releases.

scottlamb commented 2 years ago

I've done most of the coding in Retina on these checklist items (but haven't prepared either a Retina or Moonfire release yet). In the process of testing the changes in Moonfire NVR I noticed a pre-existing panic on shutdown, see #206.

Github releases: sounds reasonable. Filed #210.

robin-thoni commented 2 years ago

I've updated both of my installations to 0.7.3. I'll update when/if anything bad happen w/ it.

scottlamb commented 2 years ago

Newest commit addresses each of the checklist item. I plan to make a new release after fixing #206 and possibly #217.

Regarding this item:

start trying to avoid this situation by sending a TEARDOWN even when using TCP. I already do this for the live555 buggy cameras. Maybe we should do that explicitly for however your camera advertises itself, or always make at least one quick try before dropping the TCP session, and/or add a pull-down in the config UI for TeardownPolicy. (right now Moonfire always uses Auto)

I adjusted the Auto policy to make that one quick try; more detail here. If you take a look at your camera's tool (which will be exposed through the nvr config's "Test" button in the next release), I might tweak it to make repeated attempts as it does for known-bad servers, depending on how distinct the attribute value seems. (If it doesn't advertise a tool at all, this won't work out of course.) I haven't added that pull-down to Moonfire but still could.

scottlamb commented 2 years ago

v0.7.4 released.

robin-thoni commented 2 years ago

I just updated to 0.7.4. Before that, 0.7.3 has successfully recorded for 3 weeks in a row without any noticeable issue.

This is what it shows:

Thanks for your hard work!