go-graphite / carbon-clickhouse

Graphite metrics receiver with ClickHouse as storage
MIT License
186 stars 48 forks source link

How best to configure #144

Closed gavD closed 6 months ago

gavD commented 6 months ago

Hello, we (Telemetry team at HMRC) are using carbon-clickhouse 0.11.6 on our digital tax platform and we have some questions.

Our situation

99.9% of the time, the chunks of metrics data entering our platform are ingested by carbon-clickhouse in <1s and everything works brilliantly.

However, we have observed that every 6 months, we get a spate of chunks of data that for some reason carbon-clickhouse or clickhouse itself takes >60s to ingest. The result of this was that our Clickhouse ingest ground to a half in 5 of our 6 nodes.

Having inspected these files, I cannot find anything odd about them; it is not a mere function of their size and there's nothing obviously odd in the metrics.

Our mitigation

Initially, we moved the stuck .lz4 files out of the way; this unstuck the ingest.

We then increased our timeout to 3 minutes. (99.9% of the time our data comes nowhere near this, 99.9% of the time it's <1s to ingest).

This increase allowed us to ingest the affected .lz4 files.

Our configuration

here is what I believe to be the relavant config; I've removed sensitive info

max-cpu = 2

[data]
# Folder for buffering received data
path = "/data/carbon-clickhouse/"
# Rotate (and upload) file interval.
# Minimize chunk-interval for minimize lag between point receive and store
chunk-interval = "1s"
# Auto-increase chunk interval if the number of unprocessed files is grown
# Sample, set chunk interval to 10 if unhandled files count >= 5 and set to 60s if unhandled files count >= 20:
chunk-auto-interval = "5:10s,20:60s"
# Compression algorithm to use when storing temporary files.
# Might be useful to reduce space usage when Clickhouse is unavailable for an extended period of time.
# Currently supported: none, lz4
compression = "lz4"
# Compression level to use.
# For "lz4" 0 means use normal LZ4, >=1 use LZ4HC with this depth (the higher - the better compression, but slower)
compression-level = 0

[upload.graphite]
type = "points"
table = "graphite.graphite"
threads = 2
url = "http://localhost:8123/"
timeout = "3m0s"

[upload.graphite_index]
type = "index"
table = "graphite.graphite_index"
threads = 2
url = "http://localhost:8123/"
timeout = "3m0s"
cache-ttl = "12h0m0s"
disable-daily-index = false

[udp]
enabled = false

[tcp]
listen = "localhost:2103"
enabled = true
drop-future = "0s"
drop-past = "0s"

[pickle]
enabled = false

[grpc]
enabled = false

[prometheus]
enabled = false

Our questions

  1. Is it worth us upgrading to 0.11.7 - would that help with this intermittent issue?
  2. Are there any risks with us having increased the timeout from 1m to 3m
  3. What is the effect of chunk-auto-interval being set to "5:10s,20:60s"? My understanding is that the more chunks we have, the more carbon-clickhouse will slow the ingest cadence down to give it time ingest the larger chunks. However, if 60s is the cap, should that be raised to match our timeout? As much detail as you could give here would be so helpful to my team and I.
  4. Can you suggest anything that may be causing us to have these rare incidents?

We really really appreciate any guidance you can offer us!

Felixoid commented 6 months ago

Please, analyze the time frame of stuck files. It's highly likely that they're very wide, so the clickhouse struggles to process it.

This is pretty much the first direction I'd look.

gavD commented 6 months ago

thank you @Felixoid , that is helpful

Felixoid commented 6 months ago

Regarding the other questions

  1. there are quite some changes between the versions in the uploader module, which can affect the state. However, I suspect the clickhouse is a bottleneck. If you have a wide time in the inserted file, then it will land in multiple partitions. Clickhouse should spend a lot of memory splitting the data between them.
  2. As a consequence of the previous point, the memory spike should occur at the insertion time.
  3. The chunk-interval and timeout from upload.graphite config table are orthogonal. The first one says how often to ingest the data. The second is when to give up. And chunk-auto-interval controls the first one depending on the number of unprocessed files. Take into account that the carbon-clickhouse aggregates existing files up to chunk-max-size. So it makes sense to set the latter to some adequate boundary. I used to use something around 50M AFAIK.

You should also read the ClickHouse logs at a given moment to identify why it struggles to process the data.

gavD commented 6 months ago

Please, analyze the time frame of stuck files. It's highly likely that they're very wide, so the clickhouse struggles to process it.

Thank you so much for the advice.

We analysed 3 files; 2 that were reported as involved in the fault, and 1 that was not. We looked at the earliest and latest timestamp columns.

We observed that the "good" data covered 2min42s. The "faulty" files were 2min13s and 3min47s.

It's hard to reconstruct after the fact so it's possible I've misunderstood which data was and was not problematic, but would a span of 2 to 4 minutes indicate an issue?

We are extremely grateful for your help and will continue to investigate using your advice, thank you so much

Felixoid commented 6 months ago

We observed that the "good" data covered 2min42s. The "faulty" files were 2min13s and 3min47s.

No, it's not so big of a difference to be the reason for the fault. I expected something like months of data. What about the size of files?

The next thing I'd suggest is manually ingesting the data into the database.

On the other hand, maybe the issue is not in the carbon-clickhouse but in the database itself. Is it possible that the table was in a read-only mode at the time? Like, the zoo-/keeper was unavailable for a replicated table.

gavD commented 6 months ago

The smallest file we have that fails to ingest is 3.5mb in .lz4, 28mb uncompressed.

We copied problematic .lz4 files from an affected host into a totally separate cluster, and they get "stuck" with the error:

[2024-05-02T07:57:28.062Z] ERROR [upload] handle failed {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714557990323410639.lz4", "metrics": 136028, "error": "io: read/write on closed pipe", "time": 60.005398179}

I see no spike in either CPU or memory usage on Clickhouse. We can import the problematic data in .lz4, but only by setting carbon-clickhouse timeout above 60s.

The next thing I'd suggest is manually ingesting the data into the database.

Thanks :-) I can open the .lz4 to output what looks like a .tsv. Importing our smallest problematic file with clickhouse-client -q "INSERT INTO graphite.graphite FORMAT TabSeparated" < TEL-4503-problematic-small.tsv successfully imports 173747 rows in under 1 second. This is the correct row count ✅

If I disable compression in carbon-clickhouse and drop a .tsv of Graphite data into /data/carbon-clickhouse/default.1714637159014089574 , then I get:

[2024-05-02T08:14:41.272Z] INFO [upload] start handle {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714637159014089574"}
[2024-05-02T08:14:41.273Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/data/carbon-clickhouse/graphite_index/default.1714637159014089574"}
[2024-05-02T08:14:41.282Z] INFO [upload] handle success {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714637159014089574", "metrics": 0, "time": 0.01039542}
[2024-05-02T08:14:41.292Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/data/carbon-clickhouse/graphite_index/default.1714637159014089574", "metrics": 0, "time": 0.019001497}

So, assuming I've got it configured correctly, it's importing this file of Graphite metrics, but with 0 metrics. I think I'm doing something wrong here. However, we did have this problem back in October, before we enabled compression, so we don't think compression is the problem.

An example datum we from our import is below (with some redactions) and it looks valid:

play.<redacted>.ecs-<redacted>-p800-<redacted>.uk.gov.hmrc.play.bootstrap.metrics.MetricsFilter.200.m15_rate    0   1713513603  2024-04-19  1713513604

Upgrading to carbon-clickhouse 0.11.7 did not fix the issue with this data.

Thank you again so much for the help :-)

Felixoid commented 6 months ago

Sorry, Gavin, I am a bit out of the resources to continue here. Maybe you could get more support in https://t.me/ru_go_graphite telegram chat. Ignore that it's in Russian, it's rather international =)

gavD commented 6 months ago

thanks, I appreciate your help :-)