Closed ecsumed closed 5 years ago
Strange, your points per update is just 3, which would imply that your queues are very small, unless you have some really long interval.
I also thought the problem could be with the disk IO since it's always busy.
What does iostat say?
@piotr1212 Hi. Thanks for the quick reply. Our shortest interval is 5 minutes. Max retention goes up to a year. Here's a small iostat -x 1
. Longer dump is here -> https://paste.debian.net/1082875/
The data disk is nvme2n1
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme2n1 0.00 0.00 2155.00 0.00 12684.00 0.00 11.77 1.77 0.81 0.81 0.00 0.45 96.40
avg-cpu: %user %nice %system %iowait %steal %idle
38.21 0.00 15.38 20.00 0.00 26.41
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme2n1 0.00 0.00 2052.00 0.00 11744.00 0.00 11.45 1.98 0.98 0.98 0.00 0.47 95.60
avg-cpu: %user %nice %system %iowait %steal %idle
47.92 0.00 27.86 8.33 0.00 15.89
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 3.00 0.00 80.00 0.00 53.33 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme2n1 0.00 0.00 1935.00 1435.00 11256.00 5740.00 10.09 87.83 26.06 0.91 59.98 0.27 91.60
avg-cpu: %user %nice %system %iowait %steal %idle
46.44 0.00 24.80 10.29 0.00 18.47
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme2n1 0.00 0.00 2212.00 0.00 12792.00 0.00 11.57 1.46 0.66 0.66 0.00 0.42 93.20
Your disk seems saturated (%util) but mostly due to reads rather than writes. Reads are needed for aggregation. Can you post your storage-schemas.conf, you might be able to reduce the number of archive so you need less reads. Which version of carbon and whisper are you using?
@piotr1212 My carbon and whisper are both Version: 0.9.13
. Also, here's my storage schema. I have at most 2 archives per metric.
[carbon]
pattern = ^carbon\.
retentions = 60s:90d
[data.def.retention]
pattern = ^data\.(.*)\.def\.
retentions = 900s:1d,3600s:365d
[data.ghi.retention]
pattern = ^data\.(.*)\.ghi\.
retentions = 3600s:1d,86400s:365d
[data.jkl.retention]
pattern = ^data\.(.*)\.jkl\.
retentions = 1800s:1d,86400s:365d
[data.mno.retention]
pattern = ^data\.(.*)\.disk\.mno\.
retentions = 1800s:1d,86400s:365d
[data.disk.usage.retention]
pattern = ^data\.(.*)\.disk\.root\.
retentions = 1800s:1d,86400s:365d
[data.disk.pqr.usage.retention]
pattern = ^data\.(.*)\.disk\.pqr\.
retentions = 86400s:365d
[data.disk.stu.usage.retention]
pattern = ^data\.(.*)\.disk\.stu\.
retentions = 86400s:365d
[data.vwx.retention]
pattern = ^data\.(.*)\.vwx\.
retentions = 900s:1d,3600s:365d
[data.net.retention]
pattern = ^data\.(.*)\.net\.eth0\.
retentions = 600s:1d,3600s:365d
[data.yz.count.retention]
pattern = ^data\.(.*)\.yz\.count\.
retentions = 1800s:1d,86400s:365d
[data.var.retention]
pattern = ^data\.(.*)\.var\.
retentions = 900s:1d,3600s:365d
[data.cba.retention]
pattern = ^data\.(.*)\.cba\.
retentions = 120s:1d
[data.fed.retention]
pattern = ^data\.(.*)\.fed\.
retentions = 300s:1d,900s:365d
[data.memory.retention]
pattern = ^data\.(.*)\.memory\.
retentions = 300s:1d,900s:365d
[data.ihg.retention]
pattern = ^data\.(.*)\.ihg\.
retentions = 600s:1d,3600s:365d
[data.lkj.retention]
pattern = ^data\.(.*)\.lkj\.
retentions = 1d:365d
[abc2]
pattern = ^abc2\.
retentions = 60s:1d,900s:365d
[abc]
pattern = ^abc\.
retentions = 3600s:1825d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,900s:365d
0.9.13 is old, there are some perf improvements in later versions wrt page cache trashing. I don't know your budget for disk space, but I'd just get rid of the second archive and make the first larger, especially for the metrics which have less than 1 datapoint per 15 minutes. More ram could also help, then reads will come from page cache instead of disk, but I have no clue how much more ram you'd need.
Hi,
We're currently running a single carbon instance on an AWS C5.XLarge (4cores, 8GB) with the following config:
This runs fine. However, our use case is such that we do not want a queue as this results in a new graphs lag time of about 40-45 minutes. And with increasing clients and in turn increasing metrics, this lag will increase.
In the last few days, I've tried a combination of relay's and multiple carbons per core setups all on a single disk. None have worked.
I also thought the problem could be with the disk IO since it's always busy. To test this, I converted the
AWS 3000 IOPs
volume to anAWS 10000 IOPs
volume and increased theMAX_UPDATES_PER_SECOND
to50000
. But this had no effect as it did not even dent the queue.Am I hitting some kind of soft limit with this config? If not, what steps can I take to decrease this queue?
We have a combination of metrics of varying intervals and retention schemes.
Here's our past 25 days of relevant data: https://imgur.com/Csc9dvx (The dip on the 23rd of May, is when the service was stopped and the volume was converted to an AWS 10k IOPs)