netdata / netdata

Architected for speed. Automated for easy. Monitoring and troubleshooting, transformed!
https://www.netdata.cloud
GNU General Public License v3.0
72k stars 5.93k forks source link

[Bug]: `pulsar` module creates charts with huge number of dimensions #13320

Open ilyam8 opened 4 years ago

ilyam8 commented 4 years ago

Which messes things up on the netdata dashboard.

We need to disable/filter stuff and provide only the summary by default. As an test data input we need to use data from pulsar on the cloud testing/staging.

For instance this chart

https://github.com/netdata/go.d.plugin/blob/a8b243886cacebb4df226bfbf9e2ebe15a0b1d7d/modules/pulsar/charts.go#L418-L426

has 680 dimensions 😅

knatsakis commented 3 years ago

Ref netdata/product#1618

ilyam8 commented 3 years ago

@knatsakis do we use per topic metrics? I checked a pulsar broker instance on staging and it exposes 100k+ metrics.

[ilyam@pc ~]$ grep -v "^#" pulsar100k | wc -l
100246

There is a lot b/c of number of topics.

[ilyam@pc ~]$ grep -E -o "^[^{]+" pulsar100k | sort | uniq -c | sort -nr
   2228 pulsar_subscription_unacked_messages
   2228 pulsar_subscription_msg_throughput_out
   2228 pulsar_subscription_msg_rate_redeliver
   2228 pulsar_subscription_msg_rate_out
   2228 pulsar_subscription_delayed
   2228 pulsar_subscription_blocked_on_unacked_messages
   2228 pulsar_subscription_back_log_no_delayed
   2228 pulsar_subscription_back_log
   2222 pulsar_throughput_out
   2222 pulsar_throughput_in
   2222 pulsar_subscriptions_count
   2222 pulsar_storage_size
   2222 pulsar_rate_out
   2222 pulsar_rate_in
   2222 pulsar_producers_count
   2222 pulsar_msg_backlog
   2222 pulsar_consumers_count
   2221 pulsar_storage_write_latency_sum
   2221 pulsar_storage_write_latency_overflow
   2221 pulsar_storage_write_latency_le_50
   2221 pulsar_storage_write_latency_le_5
   2221 pulsar_storage_write_latency_le_200
   2221 pulsar_storage_write_latency_le_20
   2221 pulsar_storage_write_latency_le_1000
   2221 pulsar_storage_write_latency_le_100
   2221 pulsar_storage_write_latency_le_10
   2221 pulsar_storage_write_latency_le_1
   2221 pulsar_storage_write_latency_le_0_5
   2221 pulsar_storage_write_latency_count
   2221 pulsar_storage_offloaded_size
   2221 pulsar_storage_backlog_size
   2221 pulsar_storage_backlog_quota_limit
   2221 pulsar_in_messages_total
   2221 pulsar_in_bytes_total
   2221 pulsar_entry_size_sum
   2221 pulsar_entry_size_le_overflow
   2221 pulsar_entry_size_le_512
   2221 pulsar_entry_size_le_4_kb
   2221 pulsar_entry_size_le_2_kb
   2221 pulsar_entry_size_le_1_mb
   2221 pulsar_entry_size_le_1_kb
   2221 pulsar_entry_size_le_16_kb
   2221 pulsar_entry_size_le_128
   2221 pulsar_entry_size_le_100_kb
   2221 pulsar_entry_size_count
     11 caffeine_cache_requests_total
knatsakis commented 3 years ago

@knatsakis do we use per topic metrics?

Yes, we are.

cakrit commented 3 years ago

Actually I don't even see Pulsar now in production. Are we doing anything here? I believe someone was working on handling such a high number of dimensions for a single chart, but we don't even need to have them on individual charts.

ilyam8 commented 3 years ago

It's not enabled because of the issue. Let me provide more details.

The problem

Our Pulsar instances expose huge amount of time series - more then 100000. See https://github.com/netdata/netdata/issues/13320.

Reason is a huge number of topics. I heard that it is because not configured topic retention policy, or misbehaving clients (don't delete topics) or both - i don't remember exact reason. @knatsakis can provide details on this.

To compare - common netdata agent instance collects 2k+ metrics. It is 50 times less, then only one of our Pulsar exposes 😅

Filtering is not an option, it fixes nothing, because sre guys need all the data - not on the charts, but in the db to be able to query needed metrics.

❗ It works with prometheus, because it is (making it simple) 2 steps process:

We store charts and our dashboard queries everything and shows everything.

How to reproduce

Possible using go example collector

[ilyam@pc netdata]$ pwd
/opt/netdata/etc/netdata

[ilyam@pc netdata]$ grep -E "^module|example" go.d.conf
modules:
  example: yes
[ilyam@pc go.d]$ pwd
/opt/netdata/etc/netdata/go.d

[ilyam@pc go.d]$ grep -v "#" example.conf
jobs:
  - name: example
    update_every: 10
    charts:
      num: 50
      dimensions: 2228

Netdata problems (or possible problems)

Dashboard

Screenshot 2021-01-20 at 20 12 11

Not hard to assume that it is unusable, the dashboard lags as hell when those charts are on focus.

@jacekkolasa do you have any thoughts how to make using dashboard less painful when there are charts with huge number of dimensions?

Plugins.d

@mfundul @stelfrag what do you think about transfering so much data via plugins.d using plain-text? Could that be a bottleneck or not a problem at all?

ACLK/Cloud

@underhood

Number of metrics: 2k+ => 100k+, charts with thoushand of dimensions - any possible problems?

cakrit commented 3 years ago

We will need to aggregate data based on parts of label values. See this example:

pulsar_subscription_back_log{
 app="pulsar",
 cluster="pulsar",
 component="broker",
 exported_cluster="pulsar",
 instance="10.1.14.227:8080",
 job="kubernetes-pods",
 kubernetes_namespace="pulsar",
 kubernetes_pod_name="pulsar-broker-69c9b5bfb8-rzgcc",
 namespace="public/default",
 pod_template_hash="69c9b5bfb8",
 release="pulsar",
 subscription="cloud-spaceroom-service",
 topic="persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service-7787d55d69-slj9p"}

The really troublesome part here is the topic. But we don't care about the last part of that label value, we only want persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service. The answer @knatsakis gave isn't accurate actually, see below: image

The challenge here will be to do the aggregation, so that we collapse hundreds or thousands of individual measurements into a single dimension. This is most likely not covered by the grouping we can define on the prometheus collector now, but I'm not sure. It's also not supported by statsd synthetic charts from what I could tell from the docs. But it's a very useful feature that will allow us to generically store and show fewer dimensions.

knatsakis commented 3 years ago

Reason is a huge number of topics. I heard that it is because not configured topic retention policy, or misbehaving clients (don't delete topics) or both - i don't remember exact reason. @knatsakis can provide details on this.

It's because we are accumulating callback topics. More details can be found here: https://github.com/netdata/product/issues/755 https://github.com/netdata/product/issues/1049 https://github.com/netdata/product/issues/1478

knatsakis commented 3 years ago

The really troublesome part here is the topic. But we don't care about the last part of that label value, we only want persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service. The answer @knatsakis gave isn't accurate actually, see below:

I don't think this is the case. We don't want to aggregate the metrics. At least not us, as Netdata's SREs. Our users may want to.

In the screenshot you shared, Leonidas filters the topics to be shown in the dropdown with:

label_values(pulsar_rate_in{topic=~".*/default/[^-]+$"}, topic)

I think, we should either:

cakrit commented 3 years ago

I just asked for a resource to work on the inbox/callback topic cleanup in https://netdata-cloud.slack.com/archives/CHH8X9M5J/p1611430028006300

So that I understand though the use case, what does the -7787d55d69-slj9p in topic="persistent://public/default/ClaimNodeCallback-cloud-spaceroom-service-7787d55d69-slj9p"} tell you? Maybe it's best if we chat with @ilyam8 for a bit, to clarify.

ralphm commented 2 years ago

I am not aware of the earlier use, but the thing with -7787d55d69-slj9p at the end of topic names is not used anymore. At current count we have about 180 different value for the topic label. And this includes the partition suffixes for partioned topics. I.e. if you have a topic called foo with 6 partitions, you'll have a metric per partition, with topic having values foo-0, foo-1, … foo-5.

@ilyam8 do you still think this is problematic with such numbers?

cakrit commented 2 years ago

We have done a workaround for this in changing how we name topics, but the general product limitation remains. When we have too many dimensions, we either need to generate a lot of charts, or a lot of dimensions in a single chart.