newrelic / elixir_agent

New Relic's Open Source Elixir Agent
https://hex.pm/packages/new_relic_agent
Apache License 2.0
254 stars 97 forks source link

Add Sum of Squares calculation #428

Open zadean opened 9 months ago

zadean commented 9 months ago

Calculate "Total Sum of Squares" value for Metrics.

This uses Welford's online algorithm. It is relatively accurate when using 64-bit floating point values, only losing accuracy on float rounding. Here, there is a bit more lost since the floating points are limited in size and held as integers. (Still better than 0 imho 😄)

This also remove the sum_of_squares field from the NewRelic.Metric struct since it isn't needed for calculation and doesn't need to have an extra Access call on it. Also, counters default to 0, so removed the initial add of 0 to 0.

Adding this value to the reported metrics allows the stddev function to be used on the reported metrics. This can be helpful when looking at performance regressions, for example.

Note: This could add a bit of overhead in the extremely hot function it's in! Understandably, this could have negative effects on the overall performance of the agent and reporting metrics on time.

I also noticed that most of the agents don't have this value set, or only set to 0 or 1 or something, so I imagined it's due to the calculation overhead.

My feelings won't be hurt if this doesn't end up in the default branch 💟, but wanted to put this out there for consideration and discussion.

CLAassistant commented 9 months ago

CLA assistant check
All committers have signed the CLA.

CLAassistant commented 9 months ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

tpitale commented 9 months ago

I'm going to wait to merge this until after we finish getting a release working on otp 26. We've just merged that and will release soon.

binaryseed commented 9 months ago

Hello there! Saw this go by and wanted to offer a few considerations:

tpitale commented 9 months ago

Thank you @binaryseed. I will check in with them.

zadean commented 9 months ago

Thanks for the feedback @binaryseed ! I'll try to quantify the performance hit in the meanwhile. I'll also try adding the count and sum (maybe mean) in the acc. That might drop the performance hit a bit. Only calculating the mean once, and not on each time through.

As for if the sum_of_squares is used, I think it is. I deployed this in a test environment and was able to see the value passed through, also able to use stddev in queries without an empty value being returned.

A little context on the "why" of having this.. Testing a canary deploy in prod against the current running state. If the reply time of the top N endpoints is 1/X standard deviations more than the existing code average, the deployment is stopped. Hence the use/need of the NRQL stddev function.

Also, I checked some Java apps metrics, and they seems to have this value set to something. and the same deployment checks that fail for Elixir pass for Java. 😄 I guess setting the value to 1 instead of 0 would also do that... but yeah. Trying to do it right. 🤣

zadean commented 7 months ago

I finally found some time to benchmark this change.

main is the current implementation, new is the change in this PR, newer is an additional change to do a single match-out of the metric to not use the "dot" access. like this:

    %{
      name: name,
      scope: scope,
      call_count: call_count,
      total_call_time: total_call_time,
      total_exclusive_time: total_exclusive_time,
      min_call_time: min_call_time,
      max_call_time: max_call_time
    } = metric

I did this run with 1, 10, 100, 1,000 and 10,000 random metric values.

It seems this change does add some latency. 4:7~ish. but this code is super fast! I had to add fast_warning: false to Benchee to stop the warnings! 😄

If the latency seems to be within reason, I'll add the match-out to this change to get the faster version.

Name                          ips        average  deviation         median         99th %
metrics_newer_1         1786.32 K        0.56 μs   ±113.92%        0.41 μs        4.15 μs
metrics_new_1           1593.68 K        0.63 μs    ±93.03%        0.47 μs        4.18 μs
metrics_main_1          1265.06 K        0.79 μs   ±631.89%        0.60 μs        2.30 μs
metrics_main_10          217.22 K        4.60 μs    ±28.33%        4.40 μs        9.70 μs
metrics_newer_10         142.85 K        7.00 μs   ±253.19%           7 μs          10 μs
metrics_new_10           135.34 K        7.39 μs   ±235.98%           7 μs          11 μs
metrics_main_100          23.22 K       43.06 μs   ±103.05%          42 μs          49 μs
metrics_newer_100         15.00 K       66.68 μs    ±10.12%          66 μs          92 μs
metrics_new_100           14.25 K       70.20 μs    ±14.90%          69 μs          89 μs
metrics_main_1000          2.36 K      423.68 μs     ±4.36%         420 μs         492 μs
metrics_newer_1000         1.48 K      677.77 μs    ±34.73%         662 μs      785.38 μs
metrics_new_1000           1.40 K      714.48 μs     ±6.85%         698 μs      959.52 μs
metrics_main_10000         0.24 K     4239.77 μs    ±17.98%        4190 μs     4551.00 μs
metrics_newer_10000       0.151 K     6643.86 μs     ±1.62%        6621 μs     6978.68 μs
metrics_new_10000         0.141 K     7107.29 μs    ±16.08%        6977 μs     9256.44 μs

Comparison: 
metrics_newer_1         1786.32 K
metrics_new_1           1593.68 K - 1.12x slower +0.0677 μs
metrics_main_1          1265.06 K - 1.41x slower +0.23 μs
metrics_main_10          217.22 K - 8.22x slower +4.04 μs
metrics_newer_10         142.85 K - 12.51x slower +6.44 μs
metrics_new_10           135.34 K - 13.20x slower +6.83 μs
metrics_main_100          23.22 K - 76.92x slower +42.50 μs
metrics_newer_100         15.00 K - 119.11x slower +66.12 μs
metrics_new_100           14.25 K - 125.39x slower +69.64 μs
metrics_main_1000          2.36 K - 756.82x slower +423.12 μs
metrics_newer_1000         1.48 K - 1210.71x slower +677.21 μs
metrics_new_1000           1.40 K - 1276.29x slower +713.92 μs
metrics_main_10000         0.24 K - 7573.58x slower +4239.21 μs
metrics_newer_10000       0.151 K - 11868.04x slower +6643.30 μs
metrics_new_10000         0.141 K - 12695.87x slower +7106.73 μs