Open ilyam8 opened 4 years ago
Ref netdata/product#1618
@knatsakis do we use per topic metrics? I checked a pulsar broker instance on staging and it exposes 100k+ metrics.
[ilyam@pc ~]$ grep -v "^#" pulsar100k | wc -l
100246
There is a lot b/c of number of topics.
[ilyam@pc ~]$ grep -E -o "^[^{]+" pulsar100k | sort | uniq -c | sort -nr
2228 pulsar_subscription_unacked_messages
2228 pulsar_subscription_msg_throughput_out
2228 pulsar_subscription_msg_rate_redeliver
2228 pulsar_subscription_msg_rate_out
2228 pulsar_subscription_delayed
2228 pulsar_subscription_blocked_on_unacked_messages
2228 pulsar_subscription_back_log_no_delayed
2228 pulsar_subscription_back_log
2222 pulsar_throughput_out
2222 pulsar_throughput_in
2222 pulsar_subscriptions_count
2222 pulsar_storage_size
2222 pulsar_rate_out
2222 pulsar_rate_in
2222 pulsar_producers_count
2222 pulsar_msg_backlog
2222 pulsar_consumers_count
2221 pulsar_storage_write_latency_sum
2221 pulsar_storage_write_latency_overflow
2221 pulsar_storage_write_latency_le_50
2221 pulsar_storage_write_latency_le_5
2221 pulsar_storage_write_latency_le_200
2221 pulsar_storage_write_latency_le_20
2221 pulsar_storage_write_latency_le_1000
2221 pulsar_storage_write_latency_le_100
2221 pulsar_storage_write_latency_le_10
2221 pulsar_storage_write_latency_le_1
2221 pulsar_storage_write_latency_le_0_5
2221 pulsar_storage_write_latency_count
2221 pulsar_storage_offloaded_size
2221 pulsar_storage_backlog_size
2221 pulsar_storage_backlog_quota_limit
2221 pulsar_in_messages_total
2221 pulsar_in_bytes_total
2221 pulsar_entry_size_sum
2221 pulsar_entry_size_le_overflow
2221 pulsar_entry_size_le_512
2221 pulsar_entry_size_le_4_kb
2221 pulsar_entry_size_le_2_kb
2221 pulsar_entry_size_le_1_mb
2221 pulsar_entry_size_le_1_kb
2221 pulsar_entry_size_le_16_kb
2221 pulsar_entry_size_le_128
2221 pulsar_entry_size_le_100_kb
2221 pulsar_entry_size_count
11 caffeine_cache_requests_total
@knatsakis do we use per topic metrics?
Yes, we are.
Actually I don't even see Pulsar now in production. Are we doing anything here? I believe someone was working on handling such a high number of dimensions for a single chart, but we don't even need to have them on individual charts.
It's not enabled because of the issue. Let me provide more details.
Our Pulsar instances expose huge amount of time series - more then 100000. See https://github.com/netdata/netdata/issues/13320.
Reason is a huge number of topics. I heard that it is because not configured topic retention policy, or misbehaving clients (don't delete topics) or both - i don't remember exact reason. @knatsakis can provide details on this.
To compare - common netdata agent instance collects 2k+ metrics. It is 50 times less, then only one of our Pulsar exposes 😅
Filtering is not an option, it fixes nothing, because sre guys need all the data - not on the charts, but in the db to be able to query needed metrics.
❗ It works with prometheus, because it is (making it simple) 2 steps process:
We store charts and our dashboard queries everything and shows everything.
Possible using go example collector
example
in the go.d.conf
[ilyam@pc netdata]$ pwd
/opt/netdata/etc/netdata
[ilyam@pc netdata]$ grep -E "^module|example" go.d.conf
modules:
example: yes
go.d/example.conf
[ilyam@pc go.d]$ pwd
/opt/netdata/etc/netdata/go.d
[ilyam@pc go.d]$ grep -v "#" example.conf
jobs:
- name: example
update_every: 10
charts:
num: 50
dimensions: 2228
netdata.service
; wait a bit and check Example Charts
section on the dashboard.Not hard to assume that it is unusable, the dashboard lags as hell when those charts are on focus.
@jacekkolasa do you have any thoughts how to make using dashboard less painful when there are charts with huge number of dimensions?
@mfundul @stelfrag what do you think about transfering so much data via plugins.d using plain-text? Could that be a bottleneck or not a problem at all?
@underhood
Number of metrics: 2k+ => 100k+, charts with thoushand of dimensions - any possible problems?
We will need to aggregate data based on parts of label values. See this example:
pulsar_subscription_back_log{
app="pulsar",
cluster="pulsar",
component="broker",
exported_cluster="pulsar",
instance="10.1.14.227:8080",
job="kubernetes-pods",
kubernetes_namespace="pulsar",
kubernetes_pod_name="pulsar-broker-69c9b5bfb8-rzgcc",
namespace="public/default",
pod_template_hash="69c9b5bfb8",
release="pulsar",
subscription="cloud-spaceroom-service",
topic="persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service-7787d55d69-slj9p"}
The really troublesome part here is the topic
. But we don't care about the last part of that label value, we only want persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service
. The answer @knatsakis gave isn't accurate actually, see below:
The challenge here will be to do the aggregation, so that we collapse hundreds or thousands of individual measurements into a single dimension. This is most likely not covered by the grouping we can define on the prometheus collector now, but I'm not sure. It's also not supported by statsd synthetic charts from what I could tell from the docs. But it's a very useful feature that will allow us to generically store and show fewer dimensions.
Reason is a huge number of topics. I heard that it is because not configured topic retention policy, or misbehaving clients (don't delete topics) or both - i don't remember exact reason. @knatsakis can provide details on this.
It's because we are accumulating callback topics. More details can be found here: https://github.com/netdata/product/issues/755 https://github.com/netdata/product/issues/1049 https://github.com/netdata/product/issues/1478
The really troublesome part here is the
topic
. But we don't care about the last part of that label value, we only wantpersistent://public/default/ClaimNodeCallback-cloud-spaceroom-service
. The answer @knatsakis gave isn't accurate actually, see below:
I don't think this is the case. We don't want to aggregate the metrics. At least not us, as Netdata's SREs. Our users may want to.
In the screenshot you shared, Leonidas filters the topics to be shown in the dropdown with:
label_values(pulsar_rate_in{topic=~".*/default/[^-]+$"}, topic)
I think, we should either:
I just asked for a resource to work on the inbox/callback topic cleanup in https://netdata-cloud.slack.com/archives/CHH8X9M5J/p1611430028006300
So that I understand though the use case, what does the -7787d55d69-slj9p
in topic="persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service-7787d55d69-slj9p"}
tell you? Maybe it's best if we chat with @ilyam8 for a bit, to clarify.
I am not aware of the earlier use, but the thing with -7787d55d69-slj9p
at the end of topic names is not used anymore. At current count we have about 180 different value for the topic
label. And this includes the partition suffixes for partioned topics. I.e. if you have a topic called foo
with 6 partitions, you'll have a metric per partition, with topic
having values foo-0
, foo-1
, … foo-5
.
@ilyam8 do you still think this is problematic with such numbers?
We have done a workaround for this in changing how we name topics, but the general product limitation remains. When we have too many dimensions, we either need to generate a lot of charts, or a lot of dimensions in a single chart.
Which messes things up on the netdata dashboard.
We need to disable/filter stuff and provide only the summary by default. As an test data input we need to use data from pulsar on the cloud testing/staging.
For instance this chart
https://github.com/netdata/go.d.plugin/blob/a8b243886cacebb4df226bfbf9e2ebe15a0b1d7d/modules/pulsar/charts.go#L418-L426
has 680 dimensions 😅