sipcapture / heplify-server

HEP Capture Server for HOMER
https://sipcapture.org
GNU Affero General Public License v3.0
184 stars 85 forks source link

How is heplify_rtcp_jitter currently calculated? #404

Closed systemcrash closed 4 years ago

systemcrash commented 4 years ago

More specifically, how should it be calculated? If I read this right, it is not calculated, it's just immediately dispatched for scraping. Is this correct?

The raw values which arrive seem to do something strange: once RTCP JSON packets (jitter values) stop arriving into heplify-server - the jitter value remains where it is: constant e.g. ~309.

Screenshot 2020-06-12 at 18 29 31

In code it is defined as a Gauge: A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

When heplify-server stops receiving RTCP JSON, updates cease. The value is 'stuck' there.

It seems one short call with high jitter can cause it to get stuck 'high'.

This value is what prometheus returns - because that is the value scraped from heplify-server:

# HELP heplify_rtcp_jitter RTCP jitter
# TYPE heplify_rtcp_jitter gauge
heplify_rtcp_jitter{node_id="1"} 309

It just.... sits there. What can be done?

One thing is certain: if we are continuously receiving RTCP, the value will be continuously changing. But hypothetically were we to stop receiving RTCP in a call scenario where all media stops, the RTCP value should hang there (until receipt of next RTCP packet).

It seems that heplify-server needs to set this to 0 when it receives the BYE responsible for that RTCP Call-ID (and no other calls are on-going). Session timers and time-outs would normally tear a call down lacking a BYE anyway (more rare case). So after the negotiated time-out in SIP, this value should be set to 0.

In the presence of the BYE for the only one active call, it should be immediately set to 0.

negbie commented 4 years ago

"it's just immediately dispatched for scraping. Is this correct?" yes You can deal with stuck gauges in prometheus this is not really a heplify-server issue. Please check prometheus docs.

systemcrash commented 4 years ago

"not really a heplify-server issue" - WAT?

But if I restart heplify-server the value returns to 0, and the Gauge is 'unstuck'.

This has nothing to do with prometheus if you think about it for a moment.

negbie commented 4 years ago

Try a query like this

clamp_max(avg_over_time(heplify_rtcp_jitter{node_id=~"sip.*"}[$__interval]),100) and changes(heplify_rtcp_jitter{node_id=~"sip.*"}[$__interval]) > 0

negbie commented 4 years ago

Please try to avoid stealing my invaluable time with useless discussions ;) Thanks!

systemcrash commented 4 years ago

I have a suggestion for this case:

https://tools.ietf.org/html/rfc1889#section-6.2.1 https://tools.ietf.org/html/rfc3550#section-6.2.1

   A participant may mark another site inactive, or delete it if not yet
   valid, if no RTP or RTCP packet has been received for a small number
   of RTCP report intervals (5 is suggested). This provides some
   robustness against packet loss. All sites must calculate roughly the
   same value for the RTCP report interval in order for this timeout to
   work properly.

   Once a site has been validated, then if it is later marked inactive
   the state for that site should still be retained and the site should
   continue to be counted in the total number of sites sharing RTCP
   bandwidth for a period long enough to span typical network
   partitions.  This is to avoid excessive traffic, when the partition
   heals, due to an RTCP report interval that is too small. A timeout of
   30 minutes is suggested. Note that this is still larger than 5 times
   the largest value to which the RTCP report interval is expected to
   usefully scale, about 2 to 5 minutes.

In the absence of calls, or RTCP reports, that the value can safely be reset to 0 after 30 minutes.

negbie commented 4 years ago

I have currently no intention to change anything in that direction.

  1. that's the way gauges work.
  2. you can still use promql changes func.
  3. your suggestion would consume some time to implement it in a performant way which I don't have.
negbie commented 4 years ago

If you or your company needs this you can still decide to become a sponsor https://github.com/sponsors/negbie/