NACK storm - Githubissues

ploxiln commented 9 years ago

I have reason to believe that janus produced 58 megabytes of log messages about sending NACKs for srtp packets in about 2 minutes. Here's a very short excerpt:

[1977125745] Missed sequence number 65526, going to NACK it...
[1977125745] Missed sequence number 65527, going to NACK it...
[1977125745] Missed sequence number 65528, going to NACK it...
[1977125745] Missed sequence number 65529, going to NACK it...
[33m[WARN][0m Skipping PID to NACK (64548-64531 > 17)...
[33m[WARN][0m Skipping PID to NACK (64549-64531 > 18)...
[33m[WARN][0m Skipping PID to NACK (64550-64531 > 19)...
[33m[WARN][0m Skipping PID to NACK (64551-64531 > 20)...

Here's a longer gist of the beginning and end of the 58MB log region, with a few thousand lines so you can see the repeating pattern: https://gist.github.com/ploxiln/00760f38a6122238cb17

There are literally ~ 900,000 nack log messages in between those segments, which occurred within a 2 minute time period (though it could have been in less time than that, I don't know, I base this on my log-rolling configuration).

I can't say what symptoms it had at the time, or what behavior may have triggered it. So, I realize this is not a particularly useful issue report. I only decided to make note of it here, to perhaps inspire some sort of limit to be applied to the NACK queue behavior.

I can say it happened at least twice, the version of janus running was commit 5954a460fe3c9c3fdad326a7f3d1e02aa9bc4a8f from january 29th, it was using a self-installed libnice-0.1.7, on ubuntu-14.04.

lminiero commented 9 years ago

The report of this scenario is probably overly verbose, and prone to clog the console log for handles that suffer from network issues. What's triggering this is missing packets on the client side: more specifically, if Janus doesn't receive one or more packets from a client (and it realizes this looking at sequence numbers, which are sequential) it sends a NACK request to the client to request the packet(s) again. So if Janus is sending a lot of NACKs, it means those packets were indeed not received (whether the client didn't send them or the network dropped them is not known) and so a NACK is due, or otherwise whoever is supposed to receive those frames will get a poor experience.

I guess that, if the main problem is the excessive verbosity, a less verbose approach might be a summary every second, e.g., "Lost X packets and NACKed them"; the more verbose approach we have now may be moved to LOG_HUGE in case it's important to know what was missed.

ploxiln commented 9 years ago

Well, I was thinking, it probably doesn't make sense to send so many NACKs in the first place. Like, if 100 packets in a row need to be NACKed, it's probably best to give up on them. Or, if a packet has been NACKed twice already, it's probably best to give up on it.

nowylie commented 9 years ago

You're probably right. There should be a bandwidth limit on the RTCP data being exchanged between peers, though I don't know to what degree any of the current WebRTC implementations are doing this, and generally only important packets would be NACKed.

My understanding was that the packet loss reporting was more a tool for adapting the encoder settings to meet network conditions, and less about re-transmitting lost packets. Janus probably hasn't really been exposed to this too much yet as it seems to apply more to video streams, and I don't believe any of the plugins actually generate video data.

lminiero commented 9 years ago

One would think that just NACKing important packets would be enough (e.g., just packets belonging to a key frame in the video) but unfortunately that's not the case. We noticed in recent versions of Chrome that a missing video packet would easily lead to freezes in the video, whether it was in a key frame or not. Not sure whether it was the browser decoder or the video engine that became less tolerant, but that's what we experienced, and what eventually lead us to generate NACKs whenever a missing frame was detected.

I guess this can in part be optimized, but unfortunately there's no way around it. In principle many NACKs should suggest a poor connectivity between the peers, and as a consequence result in a lower bitrate too, but since Janus does not generate any video that's not easily viable. In part this may be mitigated by informing plugins that the connectivity is poor somehow, so that they can generate lower bitrate REMBs on either side of the communication to instruct the actual sources to slow down.

nowylie commented 9 years ago

This seems like an issue with Chrome to me. Surely the media playback should be at least partially resistant to packet loss?

lminiero commented 9 years ago

It used to be but isn't anymore. Not sure how Firefox behaves. Anyway the NACKs largely helped Janus cope with this: I agree the logging should be much less "aggressive" though. I'll have to think about other ways to optimize its behaviour, both in terms of NACKs triggering and logging.

lminiero commented 9 years ago

This commit attempts to fix the NACK behaviour a bit.

First of all, it never sends more than two aggregated NACK messages per seconds. This may or may not be the best behaviour, as NACKs mostly make sense when they're immediate. Anyway, this should limit the traffic generated because of them. Besides, now there's no limit on the sequence numbers to NACK in an aggregated feedback: before it was 16+1 (because I was lazy :-) ), now the message is extended as it should with new blocks if more sequence numbers need to be NACKed.

In general, the verbosity has been limited as well. It's still brutal when logging is set to HUGE (but it's to be expected there), while for VERB it's much less annoying. Let me know if you think it's better this way.

ploxiln commented 9 years ago

Thanks. I've deployed the updated version. I should be able to confirm the "problem" is gone in a few days. This is what the cpu usage for my janus server has been recently:

janus-cpu-week

(I suspect that's the NACK busy loop taking the entirety of one or two cores sometimes)

ploxiln commented 9 years ago

actually, that cpu usage may have also been caused by many megs of

^[[31m[ERR]^[[0m [ice.c:janus_ice_cb_nice_recv:998:] [1405973092]     SRTP unprotect error: err_status_replay_old (len=744-->744, ts=722476453, seq=40073)
^[[31m[ERR]^[[0m [ice.c:janus_ice_cb_nice_recv:998:] [1405973092]     SRTP unprotect error: err_status_replay_old (len=743-->743, ts=722476453, seq=40074)
^[[31m[ERR]^[[0m [ice.c:janus_ice_cb_nice_recv:998:] [1405973092]     SRTP unprotect error: err_status_replay_old (len=720-->720, ts=722479243, seq=40075)
^[[31m[ERR]^[[0m [ice.c:janus_ice_cb_nice_recv:998:] [1405973092]     SRTP unprotect error: err_status_replay_old (len=719-->719, ts=722479243, seq=40076)

lminiero commented 9 years ago

Those CPU spikes are anomalous, we didn't experience anything like that. Have you checked with something like htop to see which PIDs of the application were affected, and attached them via gdb to verify what part of the code caused this?

The unprotect issue may be caused by many NACKs, even though they shouldn't because we re-send already encrypted/authenticated packets through libnice, and don't feed them to SRTP again... I'll have to check if there's any issue there. Have things improved with the latest changes?

ploxiln commented 9 years ago

I think you're right, the high cpu usage is not caused by the NACK storm (in fact it may be the inverse, janus starved of cpu time caused it to not receive a bunch of packets it otherwise would have). So the strange cpu usage of janus is another issue. (I'm sure it's janus, but not sure what it's threads are doing...)

lminiero commented 9 years ago

Noticed today that the NACK behaviour seems to be more broken than I thought... hope I'll be able to fix this by today.

lminiero commented 9 years ago

Just FYI, I pushed a commit that should fix it, and seems to be working fine for me, so in case you want to test it please get the latest head first: https://github.com/meetecho/janus-gateway/commit/1e2f5d89baeb7aee4cb34854347555b9c0f33944

ploxiln commented 9 years ago

I've been running that version for a few days, haven't seen crazy runs of NACKs or overly bad NACK behavior.

Now, more than half of the log messages are:

^[[33m[WARN]^[[0m [2088551292]     Missing valid SRTP session (packet arrived too early?), skipping...

(that was 130k out of 220k lines in one of my recent rolling log files)

But that's really not the same thing, and not what this issue was about.

meetecho / janus-gateway

NACK storm #150