Closed systemcrash closed 4 years ago
"it's just immediately dispatched for scraping. Is this correct?" yes You can deal with stuck gauges in prometheus this is not really a heplify-server issue. Please check prometheus docs.
"not really a heplify-server issue" - WAT?
But if I restart heplify-server
the value returns to 0, and the Gauge is 'unstuck'.
This has nothing to do with prometheus if you think about it for a moment.
Try a query like this
clamp_max(avg_over_time(heplify_rtcp_jitter{node_id=~"sip.*"}[$__interval]),100) and changes(heplify_rtcp_jitter{node_id=~"sip.*"}[$__interval]) > 0
Please try to avoid stealing my invaluable time with useless discussions ;) Thanks!
I have a suggestion for this case:
https://tools.ietf.org/html/rfc1889#section-6.2.1 https://tools.ietf.org/html/rfc3550#section-6.2.1
A participant may mark another site inactive, or delete it if not yet
valid, if no RTP or RTCP packet has been received for a small number
of RTCP report intervals (5 is suggested). This provides some
robustness against packet loss. All sites must calculate roughly the
same value for the RTCP report interval in order for this timeout to
work properly.
Once a site has been validated, then if it is later marked inactive
the state for that site should still be retained and the site should
continue to be counted in the total number of sites sharing RTCP
bandwidth for a period long enough to span typical network
partitions. This is to avoid excessive traffic, when the partition
heals, due to an RTCP report interval that is too small. A timeout of
30 minutes is suggested. Note that this is still larger than 5 times
the largest value to which the RTCP report interval is expected to
usefully scale, about 2 to 5 minutes.
In the absence of calls, or RTCP reports, that the value can safely be reset to 0 after 30 minutes.
I have currently no intention to change anything in that direction.
If you or your company needs this you can still decide to become a sponsor https://github.com/sponsors/negbie/
More specifically, how should it be calculated? If I read this right, it is not calculated, it's just immediately dispatched for scraping. Is this correct?
The raw values which arrive seem to do something strange: once RTCP JSON packets (jitter values) stop arriving into
heplify-server
- the jitter value remains where it is: constant e.g. ~309.In code it is defined as a Gauge:
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
When heplify-server stops receiving RTCP JSON, updates cease. The value is 'stuck' there.
It seems one short call with high jitter can cause it to get stuck 'high'.
This value is what prometheus returns - because that is the value scraped from heplify-server:
It just.... sits there. What can be done?
One thing is certain: if we are continuously receiving RTCP, the value will be continuously changing. But hypothetically were we to stop receiving RTCP in a call scenario where all media stops, the RTCP value should hang there (until receipt of next RTCP packet).
It seems that
heplify-server
needs to set this to 0 when it receives the BYE responsible for that RTCP Call-ID (and no other calls are on-going). Session timers and time-outs would normally tear a call down lacking a BYE anyway (more rare case). So after the negotiated time-out in SIP, this value should be set to 0.In the presence of the BYE for the only one active call, it should be immediately set to 0.