scylladb / scylla-monitoring

Simple monitoring of Scylla with Grafana
https://scylladb.github.io/scylla-monitoring/
Apache License 2.0
240 stars 143 forks source link

Enhancement: Hide the "Load" somewhere and show "CQL Load" instead #2003

Open vladzcloudius opened 1 year ago

vladzcloudius commented 1 year ago

System information

Describe the feature and the current behavior/state. Users are being constantly confused by our "Load" graph which may get to 100% simply because there are compactions running. And when users see 100% in the "Load" (in general) - they start panicking. ;)

To avoid this we should move the current "Load" graph to, say, "Advanced" dashboard (because it's indeed meant for advanced users) and instead show a "CQL Load" graph on the "Detailed" dashboard.

This graph should show a sum on CPU utilizations of every CQL processing scheduling group (everything that starts with sl: in the Enterprise and whatever is used for that in the OSS) or whatever is selected in the "SG" selector.

Who will benefit with this feature? Customers, ScyllaDB Support personnel.

Idea suggested by @isburmistrov

vladzcloudius commented 1 year ago

cc @tomer-sandler @tarzanek @pdbossman @igorribeiroduarte @ebenzecri @gcarmin @harel-z

mykaul commented 1 year ago

@tzach ?

amnonh commented 1 year ago

@vladzcloudius I can see the value of showing the priority-class load (I'm not sure if we report it). I'm not sure about not showing the CPU real "load", Scylla does work, even if it's a background process.

tzach commented 1 year ago

Users expect a metric of overall system 0-100 "load" where 100 means the system is at max compute capacity. Example are redis loading_loaded_perc[1], cpu_percent [2] Other alternative can be "resource usage"[3].

[1] https://docs.datadoghq.com/integrations/azure_db_for_mariadb/#data-collected [2] https://docs.datadoghq.com/integrations/redisdb/?tab=host#data-collected [3] https://www.datadoghq.com/dashboards/mongodb-dashboard/

isburmistrov commented 1 year ago

I would formulate it this way:

  1. the primary thing users care about is latency
  2. users expect to see "load" metric where 100 means there is no extra room => the primary metric must get impacted

2 is not true for Scylla (sometimes true though, sometimes not true, you always need to dig deep to understand where the load is coming from).

That's why on the main dashboards (Overview, Detailed), I'd show some other metric for which 2 would be true.

tzach commented 1 year ago

Thanks @isburmistrov I do not have a strong opinion about the value of the load metric in the main dashboard. I believe the dashboard should be stable since users get used to the metrics, layout, and even colors of the dashboard. We can (and should) update it, but we must be careful about it.

TL;DR: looking for more inputs.

vladzcloudius commented 1 year ago

@vladzcloudius I can see the value of showing the priority-class load (I'm not sure if we report it).

We do, @amnonh. You can find it on the "Advanced" dashboard.

I'm not sure about not showing the CPU real "load", Scylla does work, even if it's a background process.

I didn't say we should not show it at all - just not to the user on an "Overview" and "Detailed" dashboards that are used by most users.

I said we should move it to the "Advanced" dashboard. Please, read the opening message again. It's all there.

vladzcloudius commented 1 year ago

Thanks @isburmistrov I do not have a strong opinion about the value of the load metric in the main dashboard. I believe the dashboard should be stable since users get used to the metrics, layout, and even colors of the dashboard. We can (and should) update it, but we must be careful about it.

TL;DR: looking for more inputs.

It's somewhat strange that you say this, @tzach, after we keep on changing the "Overview" and "Detailed" dashboards too often even to my taste.

However if you, for whatever reason, insist on keeping the current "Load" graph on the "Detailed", at least let's add a new graph that would show the "CQL Load" as was proposed.

This request didn't come out of boredom - it came because I got tired of explaining to our users that "Load at 100% is OK! Because..." It's a fact, Scylla's Load metric is too confusing for the vast majority of people in the world. It's time we add a graph that makes more sense to these people.

isburmistrov commented 1 year ago

I believe the dashboard should be stable since users get used to the metrics, layout, and even colors of the dashboard. We can (and should) update it, but we must be careful about it.

Hmmmm. Just recently in the new version a very confusing change was introduced about showing sl:default latencies by default only. Another recent example is that in one of the last changes all of a sudden nodes started to be identified by names rather than IPs as before.

These are just 2 examples of big change where this principle wasn't respected. Why it was decided these changes are OK and how the proposed one is different in this regard @tzach?

tzach commented 1 year ago

These are just 2 examples of big change where this principle wasn't respected. Why it was decided these changes are OK and how the proposed one is different in this regard @tzach?

I think we all agree we should be careful about UX-breaking changes moving forward.

I suggest creating a PR to test the usability of the propose change.

michoecho commented 1 year ago

This graph should show a sum on CPU utilizations of every CQL processing scheduling group (everything that starts with sl: in the Enterprise and whatever is used for that in the OSS) or whatever is selected in the "SG" selector.

Would that be useful, though? For example, in a test where I populate (with full utilization) a cluster with cassandra-stress, the distrubution of work is very roughly:

So the point is: in this workload, the cluster would be saturated with "CQL load" at 40% (or even less). In a workload consisting almost only of reads with very high cache hit ratio, the cluster would be saturated with "CQL load" at ~100%. So it seems to me that the value of this graph wouldn't be very helpful in predicting how close the cluster is to saturation.

I agree that a "true load" metric would be desirable, but I don't know how we could implement it reliably.

vladzcloudius commented 1 year ago

I agree that a "true load" metric would be desirable, but I don't know how we could implement it reliably.

Good point (and examples), @michoecho. But this is actually not to hard. What we should do is consider the current value of shares of each scheduling group + the current CPU usage.

The actual saturation point is when CQL classes can't use any more CPU given other classes CPU utilization and shares.

Then we can do any of the following (not limited to):

WDYT?

michoecho commented 1 year ago

I agree that a "true load" metric would be desirable, but I don't know how we could implement it reliably.

Good point (and examples), @michoecho. But this is actually not to hard. What we should do is consider the current value of shares of each scheduling group + the current CPU usage.

The actual saturation point is when CQL classes can't use any more CPU given other classes CPU utilization and shares.

Then we can do any of the following (not limited to):

* We can define a "CQL processing utilization" graph that would show the "current(dynamic)_saturation_utilization/current_cql_utilization". Somebody may think that this graph would be confusing since it may drop while CQL workload remains the same simply cause compactions kicked in. But this is EXACTLY what our customers want to see because this IS the truth! ;) And the truth is that it would drop - but not to zero as our customers not think when they see Load = 100%.

* We can show "worst_case_scenario_saturation_utilization/current_cql_utilization" (worst case scenario here is when all possible scheduling groups all work together and utilize the maximum amount of CPU). But the problem with this graph would be that it would always show the worse than real picture because it almost never the case that all scheduling groups work together all the time (e.g. streaming).

WDYT?

I'm not sure what formula for current_saturation_utilization you have in mind, but I think predicting saturation utilization based on shares is a good idea in general.

But we should remember that shares change dynamically, so if we calculate the saturation point based on the shares in a cluster that's not fully loaded, the prediction might be overly optimistic. (Because raising the CQL load might require raising shares in maintenance groups to keep up). And there is only so much room for error until the metric starts giving a false sense of security. I'm not sure how big the error would be in practice.

amnonh commented 1 year ago

I still look for a good solution.

What about: image

Or image

Let's split the discussion about the names (user, compaction) from the discussion about the general approach. So fight the urge and let's talk about the approach of splitting the utilization into two numbers/gauges.

I pick compaction from the priority groups because compaction is something we can delay during rush hour. It can also be

As we remember from https://github.com/scylladb/seastar/issues/1699, not all CPU utilization is accounted for, so what is not accounted for is part of the user in this case. I can defend that approach though it's not totally accurate.

An alternative is to have a utilization, user-specific group (which matches the filter) as the second column

amnonh commented 1 year ago

@vladzcloudius @michoecho do you have a suggested solution for this issue? if ther's one, it can still make it to 4.5 release