Transit stats should be reported over a shorter window than "all time"

jbg commented 3 years ago

Description

During load testing to determine achievable packet rates on new server configurations, we monitor the stats reported at /debug/stats/jvb/transit-stats, e.g.:

{
  "e2e_packet_delay": {
    "rtp": {
      "average_delay_ms": 0.10656440412295427,
      "max_delay_ms": 5,
      "total_value": 766197,
      "total_count": 7189990,
      "buckets": {
        "<= 2 ms": 7189986,
        "<= 5 ms": 4,
        "<= 20 ms": 0,
        "<= 50 ms": 0,
        "<= 200 ms": 0,
        "<= 500 ms": 0,
        "<= 1000 ms": 0,
        "> 1000 ms": 0,
        "p99<=": 2,
        "p999<=": 2
      }
    },
    "rtcp": {
      "average_delay_ms": 0.09941939843609936,
      "max_delay_ms": 3,
      "total_value": 7209,
      "total_count": 72511,
      "buckets": {
        "<= 2 ms": 72510,
        "<= 5 ms": 1,
        "<= 20 ms": 0,
        "<= 50 ms": 0,
        "<= 200 ms": 0,
        "<= 500 ms": 0,
        "<= 1000 ms": 0,
        "> 1000 ms": 0,
        "p99<=": 2,
        "p999<=": 2
      }
    }
  },
  "overall_bridge_jitter": null
}

After we exceed the achievable packet rate for the server's capability, the average_delay reduces very slowly afterward. This is because the statistics are not calculated over a window, so the average includes every datapoint since the JVB was launched.

This limits the usefulness of this (otherwise hugely useful) statistic. Overall RTP & RTCP delay are probably the one statistic that captures bridge performance better than any other single stat, but unless they are calculated over a much shorter time window than "all time" they have limited usefulness for monitoring bridge performance, since a short spike in delay can continue to affect the value for a long time.

For example, one day after a fairly short load test in our lab during which the server's capacity was deliberately exceeded:

{
  "e2e_packet_delay": {
    "rtp": {
      "average_delay_ms": 399.0452475014625,
      "max_delay_ms": 12255,
      "total_value": 103784824052,
      "total_count": 260082847,
      "buckets": {
        "<= 2 ms": 191403114,
        "<= 5 ms": 12346950,
        "<= 20 ms": 2944048,
        "<= 50 ms": 491312,
        "<= 200 ms": 1439819,
        "<= 500 ms": 3201499,
        "<= 1000 ms": 6932371,
        "> 1000 ms": 41323734,
        "p99<=": -1,
        "p999<=": -1
      }
    },
    "rtcp": {
      "average_delay_ms": 476.17974858540765,
      "max_delay_ms": 11927,
      "total_value": 2379396872,
      "total_count": 4996846,
      "buckets": {
        "<= 2 ms": 3537522,
        "<= 5 ms": 192730,
        "<= 20 ms": 51904,
        "<= 50 ms": 9649,
        "<= 200 ms": 30195,
        "<= 500 ms": 69332,
        "<= 1000 ms": 155200,
        "> 1000 ms": 950314,
        "p99<=": -1,
        "p999<=": -1
      }
    }
  },
  "overall_bridge_jitter": null
}

Actual RTP packet delay has returned back to much less than 1ms, since the server is unloaded now, but there is no way to see that from these stats, because the values are still affected by the values recorded during the load test and not many values have been recorded since.

Current behavior

Transit stats are calculated based on every data point since the JVB was started.

Expected Behavior

Transit stats should be calculated over a shorter time window ending at the present time.

Possible Solution

Implement windowing in org.jitsi.utils.stats.BucketStats. (Note: although that class is in a different repository than this issue, I am filing the issue here because JVB is where the lack of windowing has the noticeable impact.)

Steps to reproduce

While monitoring the values reported at /debug/stats/jvb/transit-stats, add traffic to the bridge until you exceed the server's performance capability. Remove the traffic and then observe that the transit stats take a long time to normalise. (If you overload the server by a large margin or for a long time, and the server is otherwise lightly loaded, they may not normalise for days.)

bgrozev commented 3 years ago

Hey @jbg I understand your use-case. We solve the problem with somee glue between th bridge and the database. We query transit-stats periodically (once a minute in our case) and subtract the values from the previous run.

jbg commented 3 years ago

Thanks for the hint, I guess you are just taking the histogram data from buckets and not looking at the average/max then?

bgrozev commented 3 years ago

Correct, we found max and average not to be very useful. We ended up extracting % of packets delayed more than X ms from the buckets for our monitoring (for X = 5, 50, 500) and this is what we graph.

jbg commented 3 years ago

Thanks! We'll set up something similar for our metrics.

How expensive is the transit-stats calculation? the max and average (if windowed) could be nice to expose more 'publicly' on the health endpoint or similar. It would make things easier for people who just want a simple metric to scrape & graph.

jitsi / jitsi-videobridge