scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.39k stars 1.55k forks source link

Enable `smp_message_queue metrics` by default #2507

Open ballard26 opened 1 month ago

ballard26 commented 1 month ago

These metrics were originally disabled for overloading the collectd servers many years back. However, these days they seem like a drop in the bucket relative to number of metrics seastar generates. Beyond this they would be useful to have when debugging issues related to cross-shard traffic.

tchaikov commented 1 month ago

@ballard26 hi Brandon, i think your PR actually fixes #2375 which could be controversial. because, @amnonh thinks that

Those metrics are quadratic in their nature and were disabled

IIUC, by "quadratic", it means we have n x n metrics if the seastar is configured with n cores. guess that's why they were disabled. but i think, despite that they are always disabled, these metrics are still added to the metric subsystem, even they are not added to the metrics map and are not exposed.

does it make sense to add a member to smp_options to at least allow user to enable them using smp::configure() ? so that if one suspects that the system is suffering from too many cross shard traffic/calls, they can enable this and restart the seastar application, to verify the hypothesis as long as the performance issue is reproducible.

@xemul @amnonh what do you think?

amnonh commented 1 month ago

I vote against such a configuration. We already have enough options to control metrics from API and files

tchaikov commented 1 month ago

I vote against such a configuration. We already have enough options to control metrics from API and files

@amnonh hi Amnon, do you mean that we are able to utilize these options to enable a given disabled metric ?

amnonh commented 1 month ago

@tchaikov yes, you can use the file or the API to to enable them, for example, I've run two nodes (docker) cluster, I've set all metrics using the following config:

$cat relabel/conf.yaml 
relabel_configs:
  - source_labels: [__name__]
    action: keep
    regex: (.*)

Using the browser: http://172.17.0.3:9180/metrics?__name__=smp*

# HELP scylla_smp_complete_batch_queue_length Current complete batch queue length
# TYPE scylla_smp_complete_batch_queue_length gauge
scylla_smp_complete_batch_queue_length{shard="0-1"} 0.000000
scylla_smp_complete_batch_queue_length{shard="1-0"} 0.000000
# HELP scylla_smp_receive_batch_queue_length Current receive batch queue length
# TYPE scylla_smp_receive_batch_queue_length gauge
scylla_smp_receive_batch_queue_length{shard="0-1"} 0.000000
scylla_smp_receive_batch_queue_length{shard="1-0"} 0.000000
# HELP scylla_smp_send_batch_queue_length Current send batch queue length
# TYPE scylla_smp_send_batch_queue_length gauge
scylla_smp_send_batch_queue_length{shard="0-1"} 1.000000
scylla_smp_send_batch_queue_length{shard="1-0"} 1.000000
# HELP scylla_smp_send_queue_length Current send queue length
# TYPE scylla_smp_send_queue_length gauge
scylla_smp_send_queue_length{shard="0-1"} 0.000000
scylla_smp_send_queue_length{shard="1-0"} 0.000000
# HELP scylla_smp_total_completed_messages Total number of messages completed
# TYPE scylla_smp_total_completed_messages counter
scylla_smp_total_completed_messages{shard="0-1"} 3883
scylla_smp_total_completed_messages{shard="1-0"} 6889
# HELP scylla_smp_total_received_messages Total number of received messages
# TYPE scylla_smp_total_received_messages counter
scylla_smp_total_received_messages{shard="0-1"} 3883
scylla_smp_total_received_messages{shard="1-0"} 6889
# HELP scylla_smp_total_sent_messages Total number of sent messages
# TYPE scylla_smp_total_sent_messages counter
scylla_smp_total_sent_messages{shard="0-1"} 3883
scylla_smp_total_sent_messages{shard="1-0"} 6889

If you have access to the API, you can do the same in runtime using the http://localhost:10000/v2/metrics-config/ GET and POST APIs

ballard26 commented 4 weeks ago

IIUC, by "quadratic", it means we have n x n metrics if the seastar is configured with n cores. guess that's why they were disabled. but i think, despite that they are always disabled, these metrics are still added to the metric subsystem, even they are not added to the metrics map and are not exposed.

Yeah, knowing this now its understandable why these metrics are not enabled by default.

Most of the time we're only ever interested in knowing the total ingress and egress of cross shard calls by shard rather than by queue. This would be a linear rather than quadratic metric relative to the number of shards and therefore something that may be more palatable to have enabled by default.

Is there any interest from you folks for having metrics like that enabled by default in seastar? If not, I believe we could achieve a similar result in our application by using set_relabel_configs(...) to split the shard="a-b" label in the existing metrics to shard="a", dst_shard="b" then set_metric_family_configs(...) to aggregate the dst_shard label.

tchaikov commented 4 weeks ago

@tchaikov yes, you can use the file or the API to to enable them, for example, I've run two nodes (docker) cluster, I've set all metrics using the following config: ...

If you have access to the API, you can do the same in runtime using the http://localhost:10000/v2/metrics-config/ GET and POST APIs

thank you very much Amnon, didn't know that we could enable metrics at runtime.