Closed gavD closed 6 months ago
Please, analyze the time frame of stuck files. It's highly likely that they're very wide, so the clickhouse struggles to process it.
This is pretty much the first direction I'd look.
thank you @Felixoid , that is helpful
Regarding the other questions
chunk-interval
and timeout
from upload.graphite
config table are orthogonal. The first one says how often to ingest the data
. The second is when to give up
. And chunk-auto-interval
controls the first one depending on the number of unprocessed files. Take into account that the carbon-clickhouse aggregates existing files up to chunk-max-size.
So it makes sense to set the latter to some adequate boundary. I used to use something around 50M AFAIK.You should also read the ClickHouse logs at a given moment to identify why it struggles to process the data.
Please, analyze the time frame of stuck files. It's highly likely that they're very wide, so the clickhouse struggles to process it.
Thank you so much for the advice.
We analysed 3 files; 2 that were reported as involved in the fault, and 1 that was not. We looked at the earliest and latest timestamp columns.
We observed that the "good" data covered 2min42s. The "faulty" files were 2min13s and 3min47s.
It's hard to reconstruct after the fact so it's possible I've misunderstood which data was and was not problematic, but would a span of 2 to 4 minutes indicate an issue?
We are extremely grateful for your help and will continue to investigate using your advice, thank you so much
We observed that the "good" data covered 2min42s. The "faulty" files were 2min13s and 3min47s.
No, it's not so big of a difference to be the reason for the fault. I expected something like months of data. What about the size of files?
The next thing I'd suggest is manually ingesting the data into the database.
On the other hand, maybe the issue is not in the carbon-clickhouse but in the database itself. Is it possible that the table was in a read-only mode at the time? Like, the zoo-/keeper was unavailable for a replicated table.
The smallest file we have that fails to ingest is 3.5mb in .lz4, 28mb uncompressed.
We copied problematic .lz4 files from an affected host into a totally separate cluster, and they get "stuck" with the error:
[2024-05-02T07:57:28.062Z] ERROR [upload] handle failed {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714557990323410639.lz4", "metrics": 136028, "error": "io: read/write on closed pipe", "time": 60.005398179}
I see no spike in either CPU or memory usage on Clickhouse. We can import the problematic data in .lz4, but only by setting carbon-clickhouse timeout above 60s.
The next thing I'd suggest is manually ingesting the data into the database.
Thanks :-) I can open the .lz4 to output what looks like a .tsv. Importing our smallest problematic file with clickhouse-client -q "INSERT INTO graphite.graphite FORMAT TabSeparated" < TEL-4503-problematic-small.tsv
successfully imports 173747
rows in under 1 second. This is the correct row count ✅
If I disable compression in carbon-clickhouse and drop a .tsv of Graphite data into /data/carbon-clickhouse/default.1714637159014089574
, then I get:
[2024-05-02T08:14:41.272Z] INFO [upload] start handle {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714637159014089574"}
[2024-05-02T08:14:41.273Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/data/carbon-clickhouse/graphite_index/default.1714637159014089574"}
[2024-05-02T08:14:41.282Z] INFO [upload] handle success {"name": "graphite", "filename": "/data/carbon-clickhouse/graphite/default.1714637159014089574", "metrics": 0, "time": 0.01039542}
[2024-05-02T08:14:41.292Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/data/carbon-clickhouse/graphite_index/default.1714637159014089574", "metrics": 0, "time": 0.019001497}
So, assuming I've got it configured correctly, it's importing this file of Graphite metrics, but with 0 metrics. I think I'm doing something wrong here. However, we did have this problem back in October, before we enabled compression, so we don't think compression is the problem.
An example datum we from our import is below (with some redactions) and it looks valid:
play.<redacted>.ecs-<redacted>-p800-<redacted>.uk.gov.hmrc.play.bootstrap.metrics.MetricsFilter.200.m15_rate 0 1713513603 2024-04-19 1713513604
Upgrading to carbon-clickhouse 0.11.7 did not fix the issue with this data.
Thank you again so much for the help :-)
Sorry, Gavin, I am a bit out of the resources to continue here. Maybe you could get more support in https://t.me/ru_go_graphite telegram chat. Ignore that it's in Russian, it's rather international =)
thanks, I appreciate your help :-)
Hello, we (Telemetry team at HMRC) are using carbon-clickhouse 0.11.6 on our digital tax platform and we have some questions.
Our situation
99.9% of the time, the chunks of metrics data entering our platform are ingested by carbon-clickhouse in <1s and everything works brilliantly.
However, we have observed that every 6 months, we get a spate of chunks of data that for some reason carbon-clickhouse or clickhouse itself takes >60s to ingest. The result of this was that our Clickhouse ingest ground to a half in 5 of our 6 nodes.
Having inspected these files, I cannot find anything odd about them; it is not a mere function of their size and there's nothing obviously odd in the metrics.
Our mitigation
Initially, we moved the stuck .lz4 files out of the way; this unstuck the ingest.
We then increased our timeout to 3 minutes. (99.9% of the time our data comes nowhere near this, 99.9% of the time it's <1s to ingest).
This increase allowed us to ingest the affected .lz4 files.
Our configuration
here is what I believe to be the relavant config; I've removed sensitive info
Our questions
chunk-auto-interval
being set to "5:10s,20:60s"? My understanding is that the more chunks we have, the more carbon-clickhouse will slow the ingest cadence down to give it time ingest the larger chunks. However, if 60s is the cap, should that be raised to match our timeout? As much detail as you could give here would be so helpful to my team and I.We really really appreciate any guidance you can offer us!