go-graphite / carbon-clickhouse

Graphite metrics receiver with ClickHouse as storage
MIT License
187 stars 47 forks source link

Index and LA #130

Open evdevk opened 1 year ago

evdevk commented 1 year ago

I have a replicated cluster (3 nodes: 54 cpu, 256gb each). All nodes are replica. Inserts mostly goes on one node. Selects on another two. Every night in ~00-00 something generate huge load (LA) on one of the node which are for selecting. I am thinking this is indexing from carbon-clickhouse, but not sure. Can you provide me some info about how cache-ttl and index working? There is no info about this in readme. How this works, for what it is?

Also can you provide me some tips to pump my config:

[data]
path = "/var/spool/carbon-tagged/"
chunk-interval = "10s"
chunk-auto-interval = ""
compression = "lz4"
compression-level = 0

[upload.graphite]
type = "points"
table = "data.data"
threads = 5
url = "http://localhost:8124/"
timeout = "1m0s"
zero-timestamp = true
compress-data = true

[upload.tags]
type = "tagged"
table = "data.tags"
threads = 6
url = "http://localhost:8124/"
timeout = "2m0s"
cache-ttl = "48h0m0s"
compress-data = true
disable-daily-index = true

[upload.graphite_index]
type = "index"
table = "data.graph_index"
threads = 3
url = "http://localhost:8124/"
timeout = "1m0s"
cache-ttl = "48h0m0s"
compress-data = true
disable-daily-index = true

Regarding to this https://github.com/go-graphite/carbon-clickhouse/pull/91

Felixoid commented 1 year ago

I suggest enabling https://clickhouse.com/docs/en/operations/system-tables/query_log and then investigating what causes the mentioned issue. It will be quite easy after that.

cache-ttl AFAIR is an internal thing to not update tags and index tables.

I rather suspect that some of your users request a huge number of metrics from graphite than it's somehow related to carbon.

evdevk commented 1 year ago

I've been tried to stop graphite-clickhouse (and keep running carbon-clickhouse) in 23-50 and still getting the same problem in ~00-01. I have no cron jobs on clickhouse. Graphite rollup have bigger period than data exist (rollup is 3 days, data TTL is 2 days) so there will be no retentions active. So, next experiment will be with stopping carbon-clickhouse.

For what actually index table exist in this soft? Can I disable it if not using grafana in my scheme?

msaf1980 commented 1 year ago

Now in new day carbon-clickhouse generate insert. Not really needed and might be refactored in future. Also expire cache TTL - also produce new insert. So, no-daily index only save disk space, not insert rate. Also produce high cpu usage, before parts are remerged (it's high cost on huge index/tags tables).

What's size/records count of your index/tags table and daily index/tags uniq metrics ?

msaf1980 commented 1 year ago

Also Clickhouse try to use all avaliable cores for background processes (like merges). So restrict them for smaller value (than 54) may be a solution.

evdevk commented 1 year ago

I stopped 5 mins before 00-00 carbon-clickhouse and there was zero load spikes. So, this is not graphite-clickhouse for sure. Will try to check out click background processes and carbon-clickhouse daily index enabled. thx @msaf1980

mikezsin commented 4 months ago

At 00:00 it generates heavy inserts - like larger x4-5 than normal size

─query_duration_ms─┬─written_bytes─┬─http_user_agent────┬────query_start_time─┬─query─────────────────────────────────────────────────────────────────────────────┐ │ 58927 │ 1412361385 │ Go-http-client/1.1 │ 2024-04-30 00:04:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59455 │ 1328363077 │ Go-http-client/1.1 │ 2024-04-30 00:05:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59805 │ 1497485418 │ Go-http-client/1.1 │ 2024-04-30 00:05:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 63204 │ 1412220635 │ Go-http-client/1.1 │ 2024-04-30 00:05:00 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60733 │ 1242829805 │ Go-http-client/1.1 │ 2024-04-30 00:05:06 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 59462 │ 1496872108 │ Go-http-client/1.1 │ 2024-04-30 00:05:59 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 58941 │ 1412365747 │ Go-http-client/1.1 │ 2024-04-30 00:06:01 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60223 │ 1497073175 │ Go-http-client/1.1 │ 2024-04-30 00:06:02 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary │ │ 60893 │ 1328369735 │ Go-http-client/1.1 │ 2024-04-30 00:06:03 │ INSERT INTO graphite.graphite_index (Date, Level, Path, Version) FORMAT RowBinary

spoofedpacket commented 2 months ago

I've also been observing this recently. We see a large spike in inserts which can last around an hour:

image

We hit the "too many parts" cap in clickhouse and carbon-clickhouse queues metrics until it recovers: [2024-06-17T14:47:16.153Z] ERROR [upload] handle failed {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1718634714840113874.lz4", "metrics": 1027490, "error": "clickhouse response status 500: Code: 252. DB::Exception: Too many parts (305). Merges are processing significantly slower than inserts. (TOO_MANY_PARTS) (version 22.8.15.25.altinitystable (altinity build))\n", "time": 1.125482143

mikezsin commented 2 months ago

I found out why LA spikes, caused by rollup aggregation rules for too many metrics.