[BUG] Gaps in persisting metrics

nickbabcock commented 4 years ago

I'm running a small single node setup of grafana + carbonapi + go-graphite + collectd. I'm seeing gaps in the data as shown below.

Sometimes it's one metric that is missing, sometimes multiple.

I've just switched to logging at the debug level instead of warn, so I'll update the post when the next gap occurs accordingly but while logging at the warn level there was no logs around the time of the gap.

go-carbon.conf

```conf [common] user = "root" graph-prefix = "carbon.agents.{host}" metric-endpoint = "local" metric-interval = "1m0s" max-cpu = 1 [whisper] data-dir = "/ssd/whisper" schemas-file = "storage-schemas.conf" aggregation-file = "storage-aggregation.conf" workers = 4 max-updates-per-second = 200 max-creates-per-second = 10 flock = true enabled = true [cache] max-size = 1000000 write-strategy = "max" [udp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0 [tcp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0 [tags] enabled = false [carbonserver] listen = "127.0.0.1:8080" enabled = true buckets = 10 metrics-as-counters = false read-timeout = "60s" write-timeout = "60s" query-cache-enabled = true query-cache-size-mb = 0 find-cache-enabled = true trigram-index = true scan-frequency = "5m0s" max-globs = 100 fail-on-max-globs = false max-metrics-globbed = 30000 max-metrics-rendered = 1000 internal-stats-dir = "" stats-percentiles = [99, 98, 95, 75, 50] [dump] enabled = false [[logging]] logger = "" file = "/go-carbon.log" level = "debug" encoding = "mixed" encoding-time = "iso8601" encoding-duration = "seconds" ```

storage-schemas.conf

```conf [carbon] pattern = ^carbon\. retentions = 60:90d [data_for_years] pattern = .* retentions = 1m:21d,15m:5y ```

storage-aggregation.conf

```conf [min] pattern = \.min$ xFilesFactor = 0.1 aggregationMethod = min [max] pattern = \.max$ xFilesFactor = 0.1 aggregationMethod = max [sum] pattern = \.count$ xFilesFactor = 0 aggregationMethod = sum [total] pattern = \.total-cpus$ xFilesFactor = 0 aggregationMethod = sum [default_average] pattern = .* xFilesFactor = 0.5 aggregationMethod = average ```

the gap in question is from this following query:

alias(collectd.$node.aggregation-cpu-average.percent-*, 'cpu')

Using the whisper dump I can see missing data, so it looks like there could be a persistence issue

1m:21d 1464: 1604675580, 7.5593058054648
1m:21d 1465: 0,         0
1m:21d 1466: 1604675700, 7.06159863045981

Versions:

go-carbon: 0.15.5
carbonapi: 0.14.1
grafana: 7.1.5
collectd 5.8

Additional info:

collectd has interval of 55 seconds
carbon agent cache metrics alternates between 0 and 156
carbon agent cache size alternatives between 0 and ~200
couldn't find any indication of an error in the logs, carbon metrics, and the system isn't under any strenuous load.
I don't think it is a collectd issue as I also send metrics via a minutely cronjob and those graphs had gaps too (I will need to check if those gaps are still present on 0.15.5 -- I just upgraded from 0.15.0 recently).

Happy to provide any other additional info.

deniszh commented 4 years ago

Hi @nickbabcock,

I'm very doubt the problem lies in go-carbon, especially if no logs says anything and flock = true. If you see gap in data I would assume it's a gap in the data - i.e. no data was arrived in specified interval, hence was not inserted in file. If it's happening often try to record incoming traffic on port 2003 and check is data really arrived during the gap. You can record traffic with tcpdump -i <interface> -s 65535 -w <file>.cap port 2003 (beware, it can consume lot of resources and disk space) and then analyze it with ngrep -I <file>.cap '<metricname>' port 2003

Civil commented 4 years ago

My .02:

It's usually not a good idea to have collector's interval set to 55s and storage schema to expect data every 60. They should match. If I'm not missing anything and still remember how whisper and go-carbon works, that might result in a periodic gap as newly arrived point will overwrite previous one instead of creating a new one (which would depend on how lucky you will be with aligning timestamp of arrived point with what's expected by storage schema).

deniszh commented 4 years ago

Right, I overlooked that. If intervals didn't match gaps are inevitable, indeed.

nickbabcock commented 4 years ago

Thanks for the feedback. A bit more clarification:

I initially had collectd with an interval of 60 but noticed gaps, so I set it to 55 to guarantee at least one point per minute (not sure why 55 seconds would make gaps inevitable).
Also I have a minutely cronjob that writes data via echo "local.random.diceroll 4 $(date +%s)" >> /dev/tcp/localhost/2003 that showed the same gaps with 0.15.0 (unsure if those gaps will occur on 0.15.5)

Since I didn't see this behavior with carbon-cache, I naturally gravitated towards a go-carbon issue after migrating to go-carbon.

But I will continue my investigation with your tips

bom-d-van commented 4 years ago

Yeah, as @Civil said, In graphite, timestamps are modulo down: https://github.com/go-graphite/go-whisper/blob/master/whisper.go#L1023

So sending it every 55 seconds means you may lose a data point every once in a while because a data point is modulo-ed to an old timestamp.

To elaborate it a bit more, take your minutely metric as an example, if the initial timestamp starts at 0, then the next timestamp is at 55 which would also be modulo-ed as 0 thus the data point looks lost but it's actually received and saved.

nickbabcock commented 4 years ago

Perhaps I'm missing something, so hopefully a bit of example may clarify my confusion.

timestamp | stored time | val
        0 |           0 |   1
       55 |           0 |   2
      110 |          60 |   3
      165 |         120 |   4
      220 |         180 |   5
      275 |         240 |   6

In the above example there is no stored time that would have a gap, right? Yes the value of 1 gets overwritten but that is fine for my use case (for me, an overwritten value is much better than a timestamp with no value)

I'll switch back to interval 60 and continue testing. As can be imagined, the gap issue is intermittent, and I can't seem to correlate with anything right now.

bom-d-van commented 4 years ago

Yeah, you are right. If it's larger than 60 there will be gaps.

For debugging purpose, maybe also log the data points in a temporary file with your cron job to make sure that there is no problem in your producer.

Also could there be a NTP or networking issue in your environment?

Oh, is this an issue just happening in v0.15+?

deniszh commented 4 years ago

I would also recommend to switch write_graphite plugin in colletd to TCP instead of UDP.

nickbabcock commented 4 years ago

Status update. Below is the collectd conf file that I'm using (updated to use interval 60.0 from our discussions here) and I'm still experiencing gaps on 0.15.5.

collectd.conf

``` Interval 60.0 FQDNLookup false LoadPlugin cpu LoadPlugin disk LoadPlugin interface LoadPlugin load LoadPlugin memory LoadPlugin df LoadPlugin swap LoadPlugin write_graphite LoadPlugin ipmi LoadPlugin aggregation ReportByCpu true ReportByState false ValuesPercentage true ValuesAbsolute true ValuesPercentage true Host "head" Port "2003" Protocol "tcp" LogSendErrors true Prefix "collectd." StoreRates true AlwaysAppendDS false EscapeCharacter "_" SeparateInstances false # ignore rootfs; else, the root file-system would appear twice, causing # one of the updates to fail and spam the log FSType rootfs # ignore the usual virtual / temporary file-systems FSType sysfs FSType proc FSType devtmpfs FSType devpts FSType tmpfs FSType fusectl FSType cgroup IgnoreSelected true ValuesPercentage true ReportBytes true # Aggregate the individual CPU data into a group level metric. We don't want # the node dashboard to need to query 20 different metrics (representing 20 # cores) to figure out what the average is Plugin "cpu" Type "percent" GroupBy "Host" GroupBy "TypeInstance" CalculateSum true CalculateAverage true ```

The debug log around the time of the gap is not exciting. Filled with lines like:

[2020-11-09T21:39:24.403Z] DEBUG [cache] WriteoutQueue sleep before rebuild {"sleepTime": "99.9773ms"}
[2020-11-09T21:39:24.503Z] DEBUG [cache] WriteoutQueue.nextRebuildOnce.Do {"now": "2020-11-09 21:39:24.5038352 +0000 UTC m=+25654.633910301", "next": "2020-11-09 21:39:24.6038137 +0000 UTC m=+25654.733890001"}

Perhaps it is a collectd issue. It's standard collectd 5.8 on centos 7. I'll trawl the collectd logs and see what I find. I'll also start outputting collectd metrics to a file

The good news is that cron script was user error 🙄

EDIT: so it may legitimately be a collectd issue https://github.com/collectd/collectd/issues/1015 I've seen the issue on other series (but this was prior to 0.15.5). I'll report back when I have more info.

nickbabcock commented 3 years ago

I'm going to close this as there has been no discernible evidence that points to a bug in go-carbon.

I appreciate the patience and debugging tips.

elq81hc commented 1 year ago

I'm facing the same issue here My graphite node is deployed in an AWS node and go-carbon listens on port 2003. In this image above, data is written to port 2003 every 10 seconds. I tried to capture the network from the graphite note and everything is fine. I suspect the issue is caused by the go-carbon itself but can't find anything in the log and not all metrics face this issue.

go-graphite / go-carbon

[BUG] Gaps in persisting metrics #389