Open evdevk opened 1 year ago
I suggest enabling https://clickhouse.com/docs/en/operations/system-tables/query_log and then investigating what causes the mentioned issue. It will be quite easy after that.
cache-ttl
AFAIR is an internal thing to not update tags and index tables.
I rather suspect that some of your users request a huge number of metrics from graphite than it's somehow related to carbon.
I've been tried to stop graphite-clickhouse (and keep running carbon-clickhouse) in 23-50 and still getting the same problem in ~00-01. I have no cron jobs on clickhouse. Graphite rollup have bigger period than data exist (rollup is 3 days, data TTL is 2 days) so there will be no retentions active. So, next experiment will be with stopping carbon-clickhouse.
For what actually index table exist in this soft? Can I disable it if not using grafana in my scheme?
Now in new day carbon-clickhouse generate insert. Not really needed and might be refactored in future. Also expire cache TTL - also produce new insert. So, no-daily index only save disk space, not insert rate. Also produce high cpu usage, before parts are remerged (it's high cost on huge index/tags tables).
What's size/records count of your index/tags table and daily index/tags uniq metrics ?
Also Clickhouse try to use all avaliable cores for background processes (like merges). So restrict them for smaller value (than 54) may be a solution.
I stopped 5 mins before 00-00 carbon-clickhouse and there was zero load spikes. So, this is not graphite-clickhouse for sure. Will try to check out click background processes and carbon-clickhouse daily index enabled. thx @msaf1980
At 00:00 it generates heavy inserts - like larger x4-5 than normal size
─query_duration_ms─┬─written_bytes─┬─http_user_agent────┬────query_start_time─┬─query─────────────────────────────────────────────────────────────────────────────┐ │ 58927 │ 1412361385 │ Go-http-client/1.1 │ 2024-04-30 00:04:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59455 │ 1328363077 │ Go-http-client/1.1 │ 2024-04-30 00:05:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59805 │ 1497485418 │ Go-http-client/1.1 │ 2024-04-30 00:05:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 63204 │ 1412220635 │ Go-http-client/1.1 │ 2024-04-30 00:05:00 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60733 │ 1242829805 │ Go-http-client/1.1 │ 2024-04-30 00:05:06 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59462 │ 1496872108 │ Go-http-client/1.1 │ 2024-04-30 00:05:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 58941 │ 1412365747 │ Go-http-client/1.1 │ 2024-04-30 00:06:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60223 │ 1497073175 │ Go-http-client/1.1 │ 2024-04-30 00:06:02 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60893 │ 1328369735 │ Go-http-client/1.1 │ 2024-04-30 00:06:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary
I've also been observing this recently. We see a large spike in inserts which can last around an hour:
We hit the "too many parts" cap in clickhouse and carbon-clickhouse queues metrics until it recovers:
[2024-06-17T14:47:16.153Z] ERROR [upload] handle failed {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1718634714840113874.lz4", "metrics": 1027490, "error": "clickhouse response status 500: Code: 252. DB::Exception: Too many parts (305). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS) (version 22.8.15.25.altinitystable (altinity build))\n", "time": 1.125482143
I found out why LA spikes, caused by rollup aggregation rules for too many metrics.
I have a replicated cluster (3 nodes: 54 cpu, 256gb each). All nodes are replica. Inserts mostly goes on one node. Selects on another two. Every night in ~00-00 something generate huge load (LA) on one of the node which are for selecting. I am thinking this is indexing from carbon-clickhouse, but not sure. Can you provide me some info about how cache-ttl and index working? There is no info about this in readme. How this works, for what it is?
Also can you provide me some tips to pump my config:
Regarding to this https://github.com/go-graphite/carbon-clickhouse/pull/91