Open zadean opened 9 months ago
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
I'm going to wait to merge this until after we finish getting a release working on otp 26. We've just merged that and will release soon.
Hello there! Saw this go by and wanted to offer a few considerations:
sum_of_squares
"slot" was used for other information, not the actual sum-of-squares. That said, the sum_of_squares
field is not filled in for any of the metrics that the elixir agent writes, so that might not matter. But it's probably the case that this field is basically considered deprecated or at least "unused"counter
module. The changes here introduce two more reads and a little extra math for every single update of every metric tracked. It won't slow down the user's app since the processing is done in a cast
to a GenServer, but it's more work overall.Thank you @binaryseed. I will check in with them.
Thanks for the feedback @binaryseed ! I'll try to quantify the performance hit in the meanwhile. I'll also try adding the count and sum (maybe mean) in the acc
. That might drop the performance hit a bit. Only calculating the mean once, and not on each time through.
As for if the sum_of_squares
is used, I think it is. I deployed this in a test environment and was able to see the value passed through, also able to use stddev
in queries without an empty value being returned.
A little context on the "why" of having this.. Testing a canary deploy in prod against the current running state. If the reply time of the top N endpoints is 1/X standard deviations more than the existing code average, the deployment is stopped. Hence the use/need of the NRQL stddev
function.
Also, I checked some Java apps metrics, and they seems to have this value set to something. and the same deployment checks that fail for Elixir pass for Java. 😄 I guess setting the value to 1 instead of 0 would also do that... but yeah. Trying to do it right. 🤣
I finally found some time to benchmark this change.
main
is the current implementation, new
is the change in this PR, newer
is an additional change to do a single match-out of the metric
to not use the "dot" access. like this:
%{
name: name,
scope: scope,
call_count: call_count,
total_call_time: total_call_time,
total_exclusive_time: total_exclusive_time,
min_call_time: min_call_time,
max_call_time: max_call_time
} = metric
I did this run with 1, 10, 100, 1,000 and 10,000 random metric values.
It seems this change does add some latency. 4:7~ish. but this code is super fast! I had to add fast_warning: false
to Benchee to stop the warnings! 😄
If the latency seems to be within reason, I'll add the match-out to this change to get the faster version.
Name ips average deviation median 99th %
metrics_newer_1 1786.32 K 0.56 μs ±113.92% 0.41 μs 4.15 μs
metrics_new_1 1593.68 K 0.63 μs ±93.03% 0.47 μs 4.18 μs
metrics_main_1 1265.06 K 0.79 μs ±631.89% 0.60 μs 2.30 μs
metrics_main_10 217.22 K 4.60 μs ±28.33% 4.40 μs 9.70 μs
metrics_newer_10 142.85 K 7.00 μs ±253.19% 7 μs 10 μs
metrics_new_10 135.34 K 7.39 μs ±235.98% 7 μs 11 μs
metrics_main_100 23.22 K 43.06 μs ±103.05% 42 μs 49 μs
metrics_newer_100 15.00 K 66.68 μs ±10.12% 66 μs 92 μs
metrics_new_100 14.25 K 70.20 μs ±14.90% 69 μs 89 μs
metrics_main_1000 2.36 K 423.68 μs ±4.36% 420 μs 492 μs
metrics_newer_1000 1.48 K 677.77 μs ±34.73% 662 μs 785.38 μs
metrics_new_1000 1.40 K 714.48 μs ±6.85% 698 μs 959.52 μs
metrics_main_10000 0.24 K 4239.77 μs ±17.98% 4190 μs 4551.00 μs
metrics_newer_10000 0.151 K 6643.86 μs ±1.62% 6621 μs 6978.68 μs
metrics_new_10000 0.141 K 7107.29 μs ±16.08% 6977 μs 9256.44 μs
Comparison:
metrics_newer_1 1786.32 K
metrics_new_1 1593.68 K - 1.12x slower +0.0677 μs
metrics_main_1 1265.06 K - 1.41x slower +0.23 μs
metrics_main_10 217.22 K - 8.22x slower +4.04 μs
metrics_newer_10 142.85 K - 12.51x slower +6.44 μs
metrics_new_10 135.34 K - 13.20x slower +6.83 μs
metrics_main_100 23.22 K - 76.92x slower +42.50 μs
metrics_newer_100 15.00 K - 119.11x slower +66.12 μs
metrics_new_100 14.25 K - 125.39x slower +69.64 μs
metrics_main_1000 2.36 K - 756.82x slower +423.12 μs
metrics_newer_1000 1.48 K - 1210.71x slower +677.21 μs
metrics_new_1000 1.40 K - 1276.29x slower +713.92 μs
metrics_main_10000 0.24 K - 7573.58x slower +4239.21 μs
metrics_newer_10000 0.151 K - 11868.04x slower +6643.30 μs
metrics_new_10000 0.141 K - 12695.87x slower +7106.73 μs
Calculate "Total Sum of Squares" value for Metrics.
This uses Welford's online algorithm. It is relatively accurate when using 64-bit floating point values, only losing accuracy on float rounding. Here, there is a bit more lost since the floating points are limited in size and held as integers. (Still better than 0 imho 😄)
This also remove the
sum_of_squares
field from theNewRelic.Metric
struct since it isn't needed for calculation and doesn't need to have an extraAccess
call on it. Also, counters default to 0, so removed the initial add of 0 to 0.Adding this value to the reported metrics allows the
stddev
function to be used on the reported metrics. This can be helpful when looking at performance regressions, for example.Note: This could add a bit of overhead in the extremely hot function it's in! Understandably, this could have negative effects on the overall performance of the agent and reporting metrics on time.
I also noticed that most of the agents don't have this value set, or only set to 0 or 1 or something, so I imagined it's due to the calculation overhead.
My feelings won't be hurt if this doesn't end up in the default branch 💟, but wanted to put this out there for consideration and discussion.