Open lapo-luchini opened 3 years ago
Something is amiss… the report states it took 32.88s, but time
reports 9'40" wall time.
% time http http://server:9500/metrics
…(all metrics)…
# Wrote 6449 metrics for 131 metric families in 32.88 s
http http://server:9500/metrics 0.36s user 0.05s system 0% cpu 9:39.43 total
99% of the calls do timeout, but the little that manage to go thru gave me this:
sort_desc(sum(avg_over_time(cassandra_exporter_collection_time_seconds_total[6h])) by (collector)) > 0.5
I'm seeing the same problem, on one node the scrape was timing out, and it's slow on the other nodes. I enabled --enable-collector-timing: the collector for cassandra_table_estimated_partitions is taking the most time by far:
This cluster has three tables, with about 6.1 million, 6.2 million, and 310k estimated partitons.
If i understand the code correctly, the estimation of the number of partitions is done by cassandra by looping over all memtables and sstables, so it don't think that the problem is caused by cassandra-exporter .
Relevant code: https://github.com/instaclustr/cassandra-exporter/blob/master/common/src/main/java/com/zegelin/cassandra/exporter/FactoriesSupplier.java#L704 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/metrics/TableMetrics.java#L481-L493
Yep, looks like estimated_partitions
is another one where C* can end up performing a whole bunch of work behind the scenes to get the values. For now, perhaps blacklist the the estimated_partitions
metric using the --exclude
option.
I'm hesitant to just enable caching for this metric, since that'll exhibit the same behaviour as #94. Might need a way to background-fetch expensive metrics and cache the result, rather than fetch them on scrape.
Hi, I have a node acting as disaster recovery DC… it doesn't have much memory or CPU resources (but of course is good as far as disk space goes) as little to no queries go to it, but I find that during heavy compaction cassandra_exporter fails to finish work in time and generates scrape timeouts on Prometheus side and broken pipes in the local logs.
The outcome seems very similar to issue #94 butI guess the causes are different as I have zero snapshots at the moment.
I'm trying to add
--enable-collector-timing
to help debug the problem, I will report more later.