Open nathanielc opened 1 year ago
we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down
This sounds great, there is a parameter we can tune to adjust performance. When creating the tdigest you provide it with a max size, I use 100 a reasonable default, but we could potentially change it as needed. A lower value will mean less allocations an likely better performance as a result.
Thanks for the quick feedback I have update the PR to be off main branch.
Additionally I updated all the tests and introduced a Digest
type. Now scenario, transaction and request types are all written in terms of the Digest type which hides TDigest as an implementation detail. This gets us one step closer to allowing generic digest implementation if we wish.
This change might have introduced some new dead code. I am going to do another pass to see if that is so and delete it accordingly.
By uncommenting line 236 of tests/controller.rs
I was able to debug the CI build failure. When requesting metrics-json
the result is much longer than expected and overfilling the buf
causing the test to fail several steps later. If there's no way to usefully reduce the size of the object returned, it's probably best to simply remove metrics-json
from the controller test.
Thanks for the hints on the failing tests. I still haven't fixed them yet but I have learned a few things:
metrics-json
test bits is not enough. For example the step in the test that tries to set the host
an expects a failure fails because the metrics are returned in json and it breaks the message passing logic.centroid
data directly. I see a few ways forward.
My vote is option 2 as it seems like the best long term solution since that crate is more active and more likely to accept the change.
Why is the format so large?
Under the hood a t-digest is a set of centroids, each centroid is two numeric values (i.e. the mean and the weight). With the max_size set to 100 the tdigest will store at most 100 centroids. However each centroid is serialized as {"mean": 1.23456789, "weight": 1.23456789}
, that repeated 100x is not efficient. A more efficient implementation would be store serialize a list of means separate from a list of weights and zip them back up at serialization time, i.e {"means": [1, 2, 3, 4, 5], "weights": [6,7,8,9,0]}
. This will save a significant amount of bytes in the serialized format. However do we still expect that two lists of 100 numbers each will still be small enough? If not we might need to consider other options like 3 to allow for larger data transfers.
The trade off with tdigest is that it is guaranteed to never use more than 100 centroids where as the previous implementation was technically unbounded, however tdigest is much more likely to use all 100 centroids. Meaning if we can get the logic working for 100 centroids we know it will always work. Additionally the value 100 was choosen somewhat arbitrarily, we could choose a smaller value if needed.
Thanks for working through these details with me. Happy to go in whichever direction.
The size of the tdigest serialization format shouldn't matter if we remove MetricsJson
from the tests... The only reason I'd see wanting to reduce the size of this object is if it's growing too large for Goose to sustain multi-day load tests. Have you tried extended load tests and seen how large the metrics grow?
To remove from the tests, I was thinking something like this patch:
diff --git a/tests/controller.rs b/tests/controller.rs
index a66ebc3..ddacd0d 100644
--- a/tests/controller.rs
+++ b/tests/controller.rs
@@ -560,19 +560,8 @@ async fn run_standalone_test(test_type: TestType) {
}
}
ControllerCommand::MetricsJson => {
- match test_state.step {
- // Request the running metrics in json format.
- 0 => {
- make_request(&mut test_state, "metrics-json\r\n").await;
- }
- // Confirm the metrics are returned in json format.
- _ => {
- assert!(response.starts_with(r#"{"hash":0,"#));
-
- // Move onto the next command.
- test_state = update_state(Some(test_state), &test_type).await;
- }
- }
+ // Move onto the next command.
+ test_state = update_state(Some(test_state), &test_type).await;
}
ControllerCommand::Start => {
match test_state.step {
@@ -715,9 +704,9 @@ async fn update_state(test_state: Option<TestState>, test_type: &TestType) -> Te
ControllerCommand::Config,
ControllerCommand::ConfigJson,
ControllerCommand::Metrics,
- ControllerCommand::MetricsJson,
ControllerCommand::Stop,
ControllerCommand::Shutdown,
+ ControllerCommand::MetricsJson,
];
if let Some(mut state) = test_state {
By moving ::MetricsJson
to after ::Shutdown
it effectively is disabled. (Something more/different will need to be done to address the WebSocket tests: I was looking at the Telnet tests for now.)
That said, now there's an (apparently) unrelated problem in which users aren't starting quickly enough and so we instead fail with the following error:
thread 'test_telnet_controller' panicked at 'assertion failed: goose_metrics.total_users == MAX_USERS', tests/controller.rs:142:5
At a quick glance I'm not sure why this would happen, as --hatch-rate
seems to correctly be set to 25
and it's sleeping 1 second which should allow all 20 users to hatch. Did you already rebase against main
where tests are working?
Thanks for the diff, I did something similar and got past that part of the test failures.
Did you already rebase against main where tests are working?
Yes, and I ran into the same issue with MAX_USERS
. I traced the issue down to these lines https://github.com/tag1consulting/goose/blob/main/tests/controller.rs#L608-L621 The assertion on line 614 fails and so the user
count is never changed to MAX_USERS
. The reason the assertion fails has something to do with the large size of the metrics in json format. When I printed out the response it was a bunch of JSON about the means and weights. I am not sure why the response to the host
command contains JSON of the metrics but it did and that broke the test system.
This change replaces this existing metrics collections with a tdigest implementation. This is beneficial for a few reasons:
This means we can get more accurate quantiles without having to use more memory that we currently do.
The previous logic would round values in order to save space, tdigest does something similar but does so in a way that minimizes the accumulated error introduced from the rounding. Instead of rounding values to predetermined values it rounds values to the closest existing point already in the distribution and only rounds when there is a need (i.e. its reached its space limit).
This PR is still WIP but wanted to start the conversation around the contribution before continuing:
TODO:
This PR addresses part of the issue #366
As an side we are using this branch successfully already getting good quantiles from our scenarios runs.