Metric correlations are often slow

manos-saratsis commented 3 years ago

Describe the bug
When running metric correlations sometimes it takes a very long time to get the results. Some other times requests time out.

To Reproduce
Steps to reproduce the behavior: Go to a single node view -> Click "Metric Correlations" -> Select an area of interest in a chart -> Click "Find Correlations". Not always reproducible

Expected behavior
Metric correlations should show results in less than 15 seconds.

Cc @dim08

papazach commented 3 years ago

Took some time to check out the logic and take some measurements of various parts of the code. The tests took place in Staging environment and more specifically in composite-charts space and 100NotesTest room.

So in a nutshell at the moment three sequential operations take place as depicted in the gantt chart below (every "day" in the chart equals to 100ms) :

cis_gantt

First, We fetch the highlight node metrics and then the baseline node metrics. Those two operations are practically the same function call but with different time parameters. Each one of them first fetches the node's chart IDs (if they are not provided in the request body) and then concurrently for each chart ID performs a timeseries data request.

Low hanging fruits here:

The biggest one is to perform the fetch highlight and baseline metrics operations concurrently since there is no coupling between those two. In staging each of them takes about 3 sec. so we ll pay only the time cost of one of them
The chart_ids can be fetched only once (or even better they could be provided by the front-end in the request body since the front-end should have them in order to render the page). The gain then would be much smaller here as approximately each /charts call took about 90ms in staging.
TODO Additional metrics should be obtained regarding the time of each individual data request and decoding. Because since those operations are performed concurrently, the total time is effectively the time of the slowest one. As mentioned, we have the total time of all those concurrent calls only, due to an incompatibility of the metrics library we use and another one that groups and executes these concurrent data requests.

Then we send the highlight and baseline metrics over to the ML service via http. This http call needs approximately 1.5 sec to return. Response decoding times here are negligible.

Some additional notes:

Currently in staging http/2 over clear text is enabled in the ADC so the we keep http open connections to a minimum and gain time reducing handshakes etc. on cases like this one that includes a lot of http requests to the ADC.

andrewm4894 commented 3 years ago

Great @papazach thanks for looking at this. 100% think we should do the fetch metrics at the same time, that should help a lot.

We had also talked about batching each fetch request into a subset of charts in each request to maybe also see if breaking up the fetch requests into smaller groups could also help. Might then need to just be a bit of coordination then as to if we need to wait and then send all the data to the ML service, or if we just send each batch, once we have the baseline and highlight for that batch to the ML service and then just need to have some additional ranking logic in the browser to handle potentially multiple responses from the ML service.

In this case the ML service just needs a chart data for the baseline and highlight and then gives a score per dim on the chart. So the ML service does not need to see all metrics in a single request so we could play around with this too.

papazach commented 3 years ago

In this case the ML service just needs a chart data for the baseline and highlight and then gives a score per dim on the chart. So the ML service does not need to see all metrics in a single request so we could play around with this too.

Nice that is also an interesting approach to keep in mind 👍

I surmise that the TODO item also yields another low to medium hanging fruit. It is not expected from an agent to respond to a data request in 2.5-3 sec. It is at least 10 times more than we would expect. Upon addressing those initial fixes we will also reason about this and perform some enhancements.

I think we can start addressing the low hanging fruits and then re-evaluate based on the new response times (which we expect to be from 4 to 5 sec. in total without addressing the TODO.)

cakrit commented 3 years ago

Note that all these requests go to the same agent, which can explain why it takes a few sec to answer all of them. If we parallelize more, the agent will be loaded even more. We may still get a better overall performance, but the agent is clearly the bottleneck here (that includes the time it takes for the responses to go through the ACLK, not just the time it takes to generate the responses). If the bound is the ACLK throughput, then we won't get much more from it. If it's the preparation of the responses, we may get an improvement, though most likely not 2x.

stelfrag commented 3 years ago

In production, I noticed that the incoming queries cap at 25 requests per second. I will try in staging.

stelfrag commented 3 years ago

In production, I noticed that the incoming queries cap at 25 requests per second. I will try in staging.

The same in staging. The cloud does not send more than 25 requests per second.

andrewm4894 commented 3 years ago

Just adding this here to decide on as part of any work done on metric correlations (and @papazach as fyi as we may decide this is worth adding next time we making any changes to metrics correlation service).

We are looking into adding a max_points type param to the service that would use the points params from here: https://registry.my-netdata.io/swagger/#/default/get_data to cap the max number of points used to get the data.

This would mean that we could remove the limit to the size of the window on the front end and instead use max_points to just aggregate the data from the agent if the window is bigger then max_points.

Below is the limit i am saying we could remove if we used some sort of max_points param:

stelfrag commented 3 years ago

@papazach Can you check if there is a limit on the maximum in-flight data requests to the agent (for the highlight and the baseline metrics)

papazach commented 3 years ago

@papazach Can you check if there is a limit on the maximum in-flight data requests to the agent (for the highlight and the baseline metrics)

Hello @stelfrag, there is a cap from the cloud side on the number of concurrent requests to the same agent and it currently defaults to 4. This applies to both highlight and baseline metrics data requests.

So, The actual requests per second seems that are actually depending on the response times of the agent but I think that I can easily measure this having the starting time of the 1st request, the time the last one finished and the number of chart IDs witch equals to the number of data requests to the agent. Would it be useful to get this rate?

stelfrag commented 3 years ago

@papazach Can you check if there is a limit on the maximum in-flight data requests to the agent (for the highlight and the baseline metrics)

Hello @stelfrag, there is a cap from the cloud side on the number of concurrent requests to the same agent and it currently defaults to 4. This applies to both highlight and baseline metrics data requests.

So, The actual requests per second seems that are actually depending on the response times of the agent but I think that I can easily measure this having the starting time of the 1st request, the time the last one finished and the number of chart IDs witch equals to the number of data requests to the agent. Would it be useful to get this rate?

You could measure it, but I think it will end up being what I see on the agent, which would be about 25 requests per second. If it is easy to change, then try to make that 8 and lets see how much difference would that make.

papazach commented 3 years ago

Just adding this here to decide on as part of any work done on metric correlations (and @papazach as fyi as we may decide this is worth adding next time we making any changes to metrics correlation service).

We are looking into adding a max_points type param to the service that would use the points params from here: https://registry.my-netdata.io/swagger/#/default/get_data to cap the max number of points used to get the data.

This would mean that we could remove the limit to the size of the window on the front end and instead use max_points to just aggregate the data from the agent if the window is bigger then max_points.

Below is the limit i am saying we could remove if we used some sort of max_points param:

Hey @andrewm4894 thanks for sharing this thought. It is quite interesting and I think it can easily be done 👍

I'll create tickets for all the improvements we talked about herein this thread, in order to start addressing them as the Kubernetes related workload hopefully is getting lighter.

papazach commented 3 years ago

Just FYI I created those two tickets:

https://github.com/netdata/product/issues/1763 - fetching baseline & highlights concurrently https://github.com/netdata/product/issues/1764 - removing the time window limit and handling big windows accordingly

papazach commented 3 years ago

Ok just managed to get some metrics with various concurrency configs. I randomly picked a node (ip-172-31-0-233) from composite-charts space - 100nodes room. And made 10 metric correlation requests with different concurrency configs (8, 16, 22).

Each call returned a pair of timings, one for the baseline and another for the highlight. Both of these operations include a set of concurrent requests towards an agent as discussed above. First a /charts one, then for each chartID received, a /data call.

With the default concurrency (4) as seen above in one run each operation took about 3 sec.

Concurrency 8:

1.35 1.07, 1.93  0.98, 2.49  1.08, 1.03 0.99, 1.18 0.98, 1.11 0.99, 1.04 0.99, 1.01 0.94, 1.09 1.30, 1.33 1.21

mean8 = 1.2 sec.

Concurrency 16:

0.71 0.72, 0.72 0.64, 0.94 0.71, 0.73 0.73, 1.59 0.9, 1.32 0.95, 0.76 0.79, 0.93 1, 1.14 1.22, 1.11 1.06

mean16 = 0.93 sec.

Concurrency 22:

1.09 1.03, 1.05 1.91, 1.42 1.11, 1.32 0.9, 0.99 0.76, 0.93 0.74, 0.81  0.86, 0.78 0.85, 0.75 1.03, 0.82 0.95

mean22 = 1.05 sec.

The numbers above are not part of a full fledged statistical analysis, approach was simplistic. Nevertheless I think that it would be a very quick win to make PR and increase the concurrency to 16.

And of course also proceed with the other tickets created, they continue to make sense.

andrewm4894 commented 3 years ago

The numbers above are not part of a full fledged statistical analysis, approach was simplistic. Nevertheless I think that it would be a very quick win to make PR and increase the concurrency to 16.

Sounds good to me. But don't take my word for it as this is outside my ballpark :)

Hopeful that this concurrency change and the points change could make this a lot more useful. Feels like we are hopeful it should.

papazach commented 3 years ago

Some initial improvements have been performed (increased ADC concurrency and concurrent data fetch for baseline and highlight metrics) and some additional are in the works:

https://github.com/netdata/product/issues/1847 https://github.com/netdata/product/issues/1848

Response times have already dropped significantly and we expect some additional gains by the above changes.

netdata / netdata-cloud

Metric correlations are often slow #5