jamessanford / remote-tsdb-clickhouse

A remote writer/reader for Prometheus that stores TSDB data in ClickHouse
Apache License 2.0
32 stars 13 forks source link

lower insert freq to 1 insert per second #11

Closed 9268 closed 2 months ago

9268 commented 3 months ago

db create sql

 CREATE TABLE metrics.prometheus (
  `updated_at` DateTime CODEC(DoubleDelta, LZ4),
  `metric_name` LowCardinality(String),
  `labels` Array(LowCardinality(String)),
  `value` Float64 CODEC(Gorilla, LZ4),
  INDEX labelset (labels, metric_name) TYPE
  set(0) GRANULARITY 8192
) ENGINE = MergeTree PARTITION BY toDate(updated_at)
ORDER BY
  (metric_name, labels, updated_at) SETTINGS index_granularity = 8192

using latest version and get error when clickhouse on high load, err msg is as follow:

2024.07.04 01:07:41.747054 [ 26952 ] {bd208208-f10e-4a8f-b5fa-3ec4b21330fe} <Error> TCPHandler: Code: 252. DB::Exception: Too many parts (3015 with average size of 357.38 KiB) in table 'metrics.prometheus (fd111252-4bd1-4de4-835f-d3d28601e20b)'. Merges are processing significantly slower than inserts. (TOO_MANY_PARTS), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cbcedbb
1. DB::Exception::Exception<unsigned long&, ReadableSize, String>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity<ReadableSize>::type, std::type_identity<String>::type>, unsigned long&, ReadableSize&&, String&&) @ 0x0000000011d8f7da
2. DB::MergeTreeData::delayInsertOrThrowIfNeeded(Poco::Event*, std::shared_ptr<DB::Context const> const&, bool) const @ 0x0000000011d8f317
3. DB::runStep(std::function<void ()>, DB::ThreadStatus*, std::atomic<unsigned long>*) @ 0x000000001263fc9c
4. DB::ExceptionKeepingTransform::work() @ 0x000000001263f290
5. DB::ExecutionThreadContext::executeTask() @ 0x00000000123d04fa
6. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x00000000123c4a50
7. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x00000000123c4468
8. DB::PushingPipelineExecutor::start() @ 0x00000000123d7b40
9. DB::TCPHandler::runImpl() @ 0x000000001235380f
10. DB::TCPHandler::run() @ 0x000000001236d099
11. Poco::Net::TCPServerConnection::start() @ 0x0000000014c9bef2
12. Poco::Net::TCPServerDispatcher::run() @ 0x0000000014c9cd39
13. Poco::PooledThread::run() @ 0x0000000014d954a1
14. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000014d93a3d
15. start_thread @ 0x0000000000007dc5
16. __clone @ 0x00000000000f773d

no idea about this issue

2024.07.04 01:54:28.447517 [ 59637 ] {a23b5dd6-9855-4278-bd1e-762804378154} <Information> metrics.prometheus (52836685-983e-40de-940a
-7205693ca9c1): Delaying inserting block by 10 ms. because there are 1007 parts and their average size is 2.39 MiB
2024.07.04 01:54:28.448034 [ 36827 ] {47665ddd-7cbc-4632-991c-15444e4fbd41} <Information> metrics.prometheus (52836685-983e-40de-940a
-7205693ca9c1): Delaying inserting block by 10 ms. because there are 1007 parts and their average size is 2.39 MiB

I checked and found the issues on clickhouse, we should lower insert freq to 1 insert per second to avoid this issue? https://github.com/ClickHouse/ClickHouse/issues/3174

9268 commented 3 months ago

ClickHouse local version 24.3.2.23 (official build)

9268 commented 3 months ago

by increase max_samples_per_send,i got batter insert freq image

remote_write:
 - url: http://:9131/write
   remote_timeout: 120s
   queue_config:
     max_samples_per_send: 50000
 - url: http://:9131/write
   remote_timeout: 120s
   queue_config:
     max_samples_per_send: 50000
jamessanford commented 2 months ago

@9268 Thanks for the report! Is your graph showing a decrease in rate(write_requests_total)? (same rate(samples_written_total) but fewer writes to clickhouse?)

I've opened pull request #12 to add a note in the README about your findings, I agree that adjusting capacity and max_samples_per_send as per Remote Write Tuning is probably necessary in production environments.

Adjusting remote_timeout should not be necessary -- we should respond pretty quickly even if this does start happening.

Good to keep an eye on prometheus_remote_storage_samples_pending suddenly growing, or if prometheus_remote_storage_samples_failed_total starts showing completely dropped samples.

jamessanford commented 2 months ago

Even on my hobby instance, adjusting max_samples_per_send shows nice benefits.

Some metrics to take a look at while adjusting:

rate(write_requests_total[5m])
rate(ClickHouseProfileEvents_MergedRows[30m])
ClickHouseMetrics_PartsCompact + ClickHouseMetrics_PartsWide
jamessanford commented 2 months ago

With the added notes in #12 , I think we're good, thanks!