Closed robin-thoni closed 2 years ago
Are you using TCP or UDP to talk with the cameras? (The default is TCP, so it's TCP unless you picked UDP within the camera setup dialog.)
sleeping for PT1S after error: stream ended
This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.
sleeping for PT1S after error: [172.16.12.2:33708(me)->10.15.9.1:554@2022-03-28T20:09:39, 1302@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=0000b8c6] Wrong ssrc; expecting ssrc=Some(76bbcb4e) seq=Some(b8c6)
I might add some more debug logging to narrow this down. Not sure how familiar you are with RTSP, but it's layered on top of another protocol, RTP. RTP packets all have a ssrc
that more or less identifies the stream. Moonfire was expecting one ssrc
(either because it was promised in a RTSP header or because a previous packet used that ssrc
; extra debug logging could tell us which) and got a packet for a different one.
When this happens, Moonfire suspects it's due to a bug in old versions of a popular RTSP server implementation called live555. In short, the server doesn't tear down TCP sessions properly, and the camera can end up sending several of them over the same connection, which cascades into problems with overload and then another bug that causes the connection to become corrupt after the TCP window camera's kernel TCP send buffer fills. There's a whole saga about this at https://github.com/scottlamb/retina/issues/17 .
Moonfire tries to avoid this cascade by waiting for the old session to go away before trying again, which is why you see this message:
waiting up to 63.999408706s for TEARDOWN or expiration of 1 stale sessions
What happens after that? There should be another attempt around 20:10:44.
I'm using TCP explicitly in the config
What happens after that? There should be another attempt around 20:10:44.
Nothing, that's the end of the logs
This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.
If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?
Nothing, that's the end of the logs
😞
It kept running but didn't log any more? Was this the only camera Moonfire was configured to record?
If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?
No, that part is definitely broken then.
This sounds similar to #184, which I'd thought I'd fixed. I'll take another look.
Reread through the code in question, nothing's jumping out at me yet.
If you don't mind, it might help to enable more logging. Where there are --env
lines in your nvr wrapper script, you can add one like --env=MOONFIRE_LOG=info,moonfire_nvr=debug,retina=debug
, then do a nvr stop; nvr rm; nvr run
to make it take effect. When a connection is dropped, there should then be a sequent of debug log lines from retina::client::teardown
that narrow down the problem.
Yeah, it kept running and recording the second camera. Just no recordings/logs for that one anymore.
I just enabled the logs, and I'm now getting logs like this:
moonfire_1 | D20220401 10:23:19.197 tokio-runtime-worker retina::client] SETUP response: Response {
moonfire_1 | version: V1_0,
moonfire_1 | status: Ok,
moonfire_1 | reason_phrase: "OK",
moonfire_1 | headers: Headers(
moonfire_1 | {
moonfire_1 | HeaderName(
moonfire_1 | "CSeq",
moonfire_1 | ): HeaderValue(
moonfire_1 | "3",
moonfire_1 | ),
moonfire_1 | HeaderName(
moonfire_1 | "Date",
moonfire_1 | ): HeaderValue(
moonfire_1 | "Fri, Apr 01 2022 10:23:19 GMT",
moonfire_1 | ),
moonfire_1 | HeaderName(
moonfire_1 | "Session",
moonfire_1 | ): HeaderValue(
moonfire_1 | "1353878211;timeout=60",
moonfire_1 | ),
moonfire_1 | HeaderName(
moonfire_1 | "Transport",
moonfire_1 | ): HeaderValue(
moonfire_1 | "RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=32887ea3;mode=\"play\"",
moonfire_1 | ),
moonfire_1 | },
moonfire_1 | ),
moonfire_1 | body: b"",
moonfire_1 | }
Is that what you're looking for? If so, I'll send you everything once it happens again
Ah, second problem: the second camera has actually stopped recording since I restarted Moonfire on March 19th:
As you can see, there's a 0/1 second recording, then nothing. I've restarted multiple time just right now, but the recording won't restart. The camera is working/streaming fine outside of Moonfire.
I attached the logs since I restarted a few minutes ago.
Probably a different bug, though?
Ahh! Got it. Those logs helped, and I believe both cameras' problems are due to the same bug.
From this log line:
W20220401 10:23:19.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: sleeping for PT1S after error: [172.16.12.2:54278(me)->10.15.9.2:554@2022-04-01T10:23:19, 1302@2022-04-01T10:23:19 ch=0, stream=0, ssrc=32887ea3, seq=000090b1] Wrong ssrc; expecting ssrc=Some(4c29bf0c) seq=Some(90b1)
I see that it reached this code here:
in which it sees a strange ssrc
and assumes it's due to a stale live555 tcp bug session, as described here:
And we see that Moonfire is waiting for this session to go away:
I20220401 10:23:20.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: waiting up to 63.999485425s for TEARDOWN or expiration of 1 stale sessions
but it never does [edit: removed confused ideas based on looking at the wrong version of the code] because you're running Moonfire v0.7.1, which predates my fix to #184.
Also, your camera isn't advertising itself as using a buggy version of live555 (even though it is using such a version, or some other software with a similar bug), or we'd have seen this line logged:
Anyway, I think there are several changes to make to improve this situation:
v0.7.2
and beyond.tool
from the SDP into a log message and/or the result of the Test
button in the camera UI, so we can see what it says. Your camera is buggy, and it might be good to find out how it advertises itself as we do for other buggy live555 cameras. (You could also grab this yourself via ffmpeg or a packet capture.)TEARDOWN
even when using TCP. I already do this for the live555 buggy cameras. Maybe we should do that explicitly for however your camera advertises itself, or always make at least one quick try before dropping the TCP session, and/or add a pull-down in the config UI for TeardownPolicy
. (right now Moonfire always uses Auto
)In the meantime, a couple ways you can work around this problem:
Ah, glad it helped!
Thanks for taking for the time to chase that bug! Unfortunately, I have little to no knowledge to help you to fix it...
By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.
Hi Robin, "Poc"?? and could you expound upon your implementation of an alert system? Perhaps through the mailing list for this project? Thank you, John
Hi John, I meant Proof of Concept. That's a very simple simple gateway that will authenticate to Moonfire and do the following:
startDate
, endDate
and duration
, basicallyThe second part is a Zabbix template to connect Zabbix to it. That's the end result:
I plan to release it shortly
Unfortunately, I have little to no knowledge to help you to fix it...
No worries. If you'd like to learn more (about Rust, RTSP, etc.), let me know. Otherwise, I've got this.
By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.
Neat! I don't know Zabbix, but in general: I'm open to adding support for white-box monitoring to Moonfire itself also. We can discuss more in a separate issue. I've been meaning for a while to export metrics in the Prometheus exposition format. It looks like Zabbix supports this format also.
Actually, come to think of it, even if we did go on to launch the background task, that only tears down sessions with a known id, not these special live555 unknown session id stale sessions. My fix for https://github.com/scottlamb/moonfire-nvr/issues/184 wasn't complete. I started sending TEARDOWN requests for TCP sessions when we know the camera is buggy, which I guess has mostly avoided the problem since for @jlpoolen . But once a stale session with unknown id is noted, we never clean it up. Oops.
Huh. On second (third? fourth? I've lost track) look, I have code that's supposed to do this here:
and it seems to be working in my best attempt to reproduce locally (an old Reolink camera with this bug, a hardcoded change to TeardownPolicy::Never
, quick restart so there's a bad session to clean up). Huh. I may just have to release a version with more logging to figure this out.
I may just have to release a version with more logging to figure this out.
Simply push a docker image on a different name, I'll run and try it until it fails. As you can see, it can take 1 to 2 weeks.
I'm open to adding support for white-box monitoring to Moonfire itself also
Nice to hear that! I'll have a deeper look then.
Oh! Moonfire NVR 0.7.1 is older than I realized. It predates my fix for #184. I should have been looking at Retina version v0.3.4 rather than latest when trying to figure this out. That explains why it didn't recover after 65 seconds. If you upgrade to the latest Moonfire NVR (note you'll need to adapt your nvr
script and create a /etc/moonfire-nvr.toml
file), this behavior will be better.
The other items in my checklist still apply.
I'll do the upgrade (tomorrow?) and see if it helps.
BTW, would you mind making use of Github releases? That makes it much easier to track new releases.
I've done most of the coding in Retina on these checklist items (but haven't prepared either a Retina or Moonfire release yet). In the process of testing the changes in Moonfire NVR I noticed a pre-existing panic on shutdown, see #206.
Github releases: sounds reasonable. Filed #210.
I've updated both of my installations to 0.7.3. I'll update when/if anything bad happen w/ it.
Newest commit addresses each of the checklist item. I plan to make a new release after fixing #206 and possibly #217.
Regarding this item:
start trying to avoid this situation by sending a
TEARDOWN
even when using TCP. I already do this for the live555 buggy cameras. Maybe we should do that explicitly for however your camera advertises itself, or always make at least one quick try before dropping the TCP session, and/or add a pull-down in the config UI forTeardownPolicy
. (right now Moonfire always usesAuto
)
I adjusted the Auto
policy to make that one quick try; more detail here. If you take a look at your camera's tool (which will be exposed through the nvr config
's "Test" button in the next release), I might tweak it to make repeated attempts as it does for known-bad servers, depending on how distinct the attribute value seems. (If it doesn't advertise a tool at all, this won't work out of course.) I haven't added that pull-down to Moonfire but still could.
v0.7.4 released.
I just updated to 0.7.4. Before that, 0.7.3 has successfully recorded for 3 weeks in a row without any noticeable issue.
This is what it shows:
Trendnet TV-IP311PI:
Trendnet TV-IP315PI:
Thanks for your hard work!
Describe the bug One of my camera (the same as in #197) stops recording after a few days with the following logs:
Expected behavior The camera should record indefinitely, or restart if the stream dies for some reason.
Screenshots
Server (please complete the following information):
Camera (please complete the following information): Model: Trendnet TV-IP311PI Firmware Version: V5.0.1 140303