Feat use tdigest for metrics

nathanielc commented 1 year ago

This change replaces this existing metrics collections with a tdigest implementation. This is beneficial for a few reasons:

It is bounded in the amount of memory space it uses
It is a statistically sound approach to approximating quantiles of a distribution
Tdigest is serializable and multiple tdigests can be merged into a single tdigest making it well suited to gaggle use cases.

This means we can get more accurate quantiles without having to use more memory that we currently do.

The previous logic would round values in order to save space, tdigest does something similar but does so in a way that minimizes the accumulated error introduced from the rounding. Instead of rounding values to predetermined values it rounds values to the closest existing point already in the distribution and only rounds when there is a need (i.e. its reached its space limit).

This PR is still WIP but wanted to start the conversation around the contribution before continuing:

TODO:

Update all tests
Update all places that previously used BTreeMap<usize, usize> to use tdigest
Rebase against main branch. We use gaggle so this PR is a branch off 0.16.4, not really sure how much work it would need to target main.

This PR addresses part of the issue #366

As an side we are using this branch successfully already getting good quantiles from our scenarios runs.

nathanielc commented 1 year ago

we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down

This sounds great, there is a parameter we can tune to adjust performance. When creating the tdigest you provide it with a max size, I use 100 a reasonable default, but we could potentially change it as needed. A lower value will mean less allocations an likely better performance as a result.

nathanielc commented 1 year ago

Thanks for the quick feedback I have update the PR to be off main branch.

Additionally I updated all the tests and introduced a Digest type. Now scenario, transaction and request types are all written in terms of the Digest type which hides TDigest as an implementation detail. This gets us one step closer to allowing generic digest implementation if we wish.

This change might have introduced some new dead code. I am going to do another pass to see if that is so and delete it accordingly.

jeremyandrews commented 1 year ago

By uncommenting line 236 of tests/controller.rs I was able to debug the CI build failure. When requesting metrics-json the result is much longer than expected and overfilling the buf causing the test to fail several steps later. If there's no way to usefully reduce the size of the object returned, it's probably best to simply remove metrics-json from the controller test.

nathanielc commented 1 year ago

Thanks for the hints on the failing tests. I still haven't fixed them yet but I have learned a few things:

The change to tdigest has caused several failures because its serialized format is large. (overflows the buffer)
- Removing the metrics-json test bits is not enough. For example the step in the test that tries to set the host an expects a failure fails because the metrics are returned in json and it breaks the message passing logic.
- I expect other code paths will also break
Its not possible to change the serialization format to something more compact because the tdigest crate does not expose the centroid data directly.

I see a few ways forward.

Submit a PR to the tdigest crate to improve its serialization format
Change to the t-digest implementation to this other crate https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html That crate does not have a serialization format currently however they have an open issue for it https://github.com/crepererum/pdatastructs.rs/issues/61 and our implementation could ensure its compact enough.
Change something in Goose to allow for larger data transfer. I haven't dug into the message passing logic much yet, but my assumption is we have a fixed size buffer to keep the packet size low for telnet? Maybe there is something we can do here?
Something else?

My vote is option 2 as it seems like the best long term solution since that crate is more active and more likely to accept the change.

Why is the format so large?

Under the hood a t-digest is a set of centroids, each centroid is two numeric values (i.e. the mean and the weight). With the max_size set to 100 the tdigest will store at most 100 centroids. However each centroid is serialized as {"mean": 1.23456789, "weight": 1.23456789}, that repeated 100x is not efficient. A more efficient implementation would be store serialize a list of means separate from a list of weights and zip them back up at serialization time, i.e {"means": [1, 2, 3, 4, 5], "weights": [6,7,8,9,0]}. This will save a significant amount of bytes in the serialized format. However do we still expect that two lists of 100 numbers each will still be small enough? If not we might need to consider other options like 3 to allow for larger data transfers.

The trade off with tdigest is that it is guaranteed to never use more than 100 centroids where as the previous implementation was technically unbounded, however tdigest is much more likely to use all 100 centroids. Meaning if we can get the logic working for 100 centroids we know it will always work. Additionally the value 100 was choosen somewhat arbitrarily, we could choose a smaller value if needed.

Thanks for working through these details with me. Happy to go in whichever direction.

jeremyandrews commented 1 year ago

The size of the tdigest serialization format shouldn't matter if we remove MetricsJson from the tests... The only reason I'd see wanting to reduce the size of this object is if it's growing too large for Goose to sustain multi-day load tests. Have you tried extended load tests and seen how large the metrics grow?

To remove from the tests, I was thinking something like this patch:

diff --git a/tests/controller.rs b/tests/controller.rs
index a66ebc3..ddacd0d 100644
--- a/tests/controller.rs
+++ b/tests/controller.rs
@@ -560,19 +560,8 @@ async fn run_standalone_test(test_type: TestType) {
                     }
                 }
                 ControllerCommand::MetricsJson => {
-                    match test_state.step {
-                        // Request the running metrics in json format.
-                        0 => {
-                            make_request(&mut test_state, "metrics-json\r\n").await;
-                        }
-                        // Confirm the metrics are returned in json format.
-                        _ => {
-                            assert!(response.starts_with(r#"{"hash":0,"#));
-
-                            // Move onto the next command.
-                            test_state = update_state(Some(test_state), &test_type).await;
-                        }
-                    }
+                    // Move onto the next command.
+                    test_state = update_state(Some(test_state), &test_type).await;
                 }
                 ControllerCommand::Start => {
                     match test_state.step {
@@ -715,9 +704,9 @@ async fn update_state(test_state: Option<TestState>, test_type: &TestType) -> Te
         ControllerCommand::Config,
         ControllerCommand::ConfigJson,
         ControllerCommand::Metrics,
-        ControllerCommand::MetricsJson,
         ControllerCommand::Stop,
         ControllerCommand::Shutdown,
+        ControllerCommand::MetricsJson,
     ];

     if let Some(mut state) = test_state {

By moving ::MetricsJson to after ::Shutdown it effectively is disabled. (Something more/different will need to be done to address the WebSocket tests: I was looking at the Telnet tests for now.)

That said, now there's an (apparently) unrelated problem in which users aren't starting quickly enough and so we instead fail with the following error:

thread 'test_telnet_controller' panicked at 'assertion failed: goose_metrics.total_users == MAX_USERS', tests/controller.rs:142:5

At a quick glance I'm not sure why this would happen, as --hatch-rate seems to correctly be set to 25 and it's sleeping 1 second which should allow all 20 users to hatch. Did you already rebase against main where tests are working?

nathanielc commented 1 year ago

Thanks for the diff, I did something similar and got past that part of the test failures.

Did you already rebase against main where tests are working?

Yes, and I ran into the same issue with MAX_USERS. I traced the issue down to these lines https://github.com/tag1consulting/goose/blob/main/tests/controller.rs#L608-L621 The assertion on line 614 fails and so the user count is never changed to MAX_USERS. The reason the assertion fails has something to do with the large size of the metrics in json format. When I printed out the response it was a bunch of JSON about the means and weights. I am not sure why the response to the host command contains JSON of the metrics but it did and that broke the test system.

tag1consulting / goose

Feat use tdigest for metrics #546