Closed nickbabcock closed 3 years ago
Hi @nickbabcock,
I'm very doubt the problem lies in go-carbon, especially if no logs says anything and flock = true
.
If you see gap in data I would assume it's a gap in the data - i.e. no data was arrived in specified interval, hence was not inserted in file.
If it's happening often try to record incoming traffic on port 2003 and check is data really arrived during the gap.
You can record traffic with tcpdump -i <interface> -s 65535 -w <file>.cap port 2003
(beware, it can consume lot of resources and disk space) and then analyze it with ngrep -I <file>.cap '<metricname>' port 2003
My .02:
It's usually not a good idea to have collector's interval set to 55s and storage schema to expect data every 60. They should match. If I'm not missing anything and still remember how whisper and go-carbon works, that might result in a periodic gap as newly arrived point will overwrite previous one instead of creating a new one (which would depend on how lucky you will be with aligning timestamp of arrived point with what's expected by storage schema).
Right, I overlooked that. If intervals didn't match gaps are inevitable, indeed.
Thanks for the feedback. A bit more clarification:
echo "local.random.diceroll 4 $(date +%s)" >> /dev/tcp/localhost/2003
that showed the same gaps with 0.15.0 (unsure if those gaps will occur on 0.15.5)Since I didn't see this behavior with carbon-cache, I naturally gravitated towards a go-carbon issue after migrating to go-carbon.
But I will continue my investigation with your tips
Yeah, as @Civil said, In graphite, timestamps are modulo down: https://github.com/go-graphite/go-whisper/blob/master/whisper.go#L1023
So sending it every 55 seconds means you may lose a data point every once in a while because a data point is modulo-ed to an old timestamp.
To elaborate it a bit more, take your minutely metric as an example, if the initial timestamp starts at 0, then the next timestamp is at 55 which would also be modulo-ed as 0 thus the data point looks lost but it's actually received and saved.
Perhaps I'm missing something, so hopefully a bit of example may clarify my confusion.
timestamp | stored time | val
0 | 0 | 1
55 | 0 | 2
110 | 60 | 3
165 | 120 | 4
220 | 180 | 5
275 | 240 | 6
In the above example there is no stored time that would have a gap, right? Yes the value of 1 gets overwritten but that is fine for my use case (for me, an overwritten value is much better than a timestamp with no value)
I'll switch back to interval 60 and continue testing. As can be imagined, the gap issue is intermittent, and I can't seem to correlate with anything right now.
Yeah, you are right. If it's larger than 60 there will be gaps.
For debugging purpose, maybe also log the data points in a temporary file with your cron job to make sure that there is no problem in your producer.
Also could there be a NTP or networking issue in your environment?
Oh, is this an issue just happening in v0.15+?
I would also recommend to switch write_graphite plugin in colletd to TCP instead of UDP.
Status update. Below is the collectd conf file that I'm using (updated to use interval 60.0 from our discussions here) and I'm still experiencing gaps on 0.15.5.
The debug log around the time of the gap is not exciting. Filled with lines like:
[2020-11-09T21:39:24.403Z] DEBUG [cache] WriteoutQueue sleep before rebuild {"sleepTime": "99.9773ms"}
[2020-11-09T21:39:24.503Z] DEBUG [cache] WriteoutQueue.nextRebuildOnce.Do {"now": "2020-11-09 21:39:24.5038352 +0000 UTC m=+25654.633910301", "next": "2020-11-09 21:39:24.6038137 +0000 UTC m=+25654.733890001"}
Perhaps it is a collectd issue. It's standard collectd 5.8 on centos 7. I'll trawl the collectd logs and see what I find. I'll also start outputting collectd metrics to a file
The good news is that cron script was user error 🙄
EDIT: so it may legitimately be a collectd issue https://github.com/collectd/collectd/issues/1015 I've seen the issue on other series (but this was prior to 0.15.5). I'll report back when I have more info.
I'm going to close this as there has been no discernible evidence that points to a bug in go-carbon.
I appreciate the patience and debugging tips.
I'm facing the same issue here My graphite node is deployed in an AWS node and go-carbon listens on port 2003. In this image above, data is written to port 2003 every 10 seconds. I tried to capture the network from the graphite note and everything is fine. I suspect the issue is caused by the go-carbon itself but can't find anything in the log and not all metrics face this issue.
I'm running a small single node setup of grafana + carbonapi + go-graphite + collectd. I'm seeing gaps in the data as shown below.
Sometimes it's one metric that is missing, sometimes multiple.
I've just switched to logging at the debug level instead of
warn
, so I'll update the post when the next gap occurs accordingly but while logging at thewarn
level there was no logs around the time of the gap.go-carbon.conf
```conf [common] user = "root" graph-prefix = "carbon.agents.{host}" metric-endpoint = "local" metric-interval = "1m0s" max-cpu = 1 [whisper] data-dir = "/ssd/whisper" schemas-file = "storage-schemas.conf" aggregation-file = "storage-aggregation.conf" workers = 4 max-updates-per-second = 200 max-creates-per-second = 10 flock = true enabled = true [cache] max-size = 1000000 write-strategy = "max" [udp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0 [tcp] listen = "0.0.0.0:2003" enabled = true buffer-size = 0 [tags] enabled = false [carbonserver] listen = "127.0.0.1:8080" enabled = true buckets = 10 metrics-as-counters = false read-timeout = "60s" write-timeout = "60s" query-cache-enabled = true query-cache-size-mb = 0 find-cache-enabled = true trigram-index = true scan-frequency = "5m0s" max-globs = 100 fail-on-max-globs = false max-metrics-globbed = 30000 max-metrics-rendered = 1000 internal-stats-dir = "" stats-percentiles = [99, 98, 95, 75, 50] [dump] enabled = false [[logging]] logger = "" file = "/go-carbon.log" level = "debug" encoding = "mixed" encoding-time = "iso8601" encoding-duration = "seconds" ```storage-schemas.conf
```conf [carbon] pattern = ^carbon\. retentions = 60:90d [data_for_years] pattern = .* retentions = 1m:21d,15m:5y ```storage-aggregation.conf
```conf [min] pattern = \.min$ xFilesFactor = 0.1 aggregationMethod = min [max] pattern = \.max$ xFilesFactor = 0.1 aggregationMethod = max [sum] pattern = \.count$ xFilesFactor = 0 aggregationMethod = sum [total] pattern = \.total-cpus$ xFilesFactor = 0 aggregationMethod = sum [default_average] pattern = .* xFilesFactor = 0.5 aggregationMethod = average ```the gap in question is from this following query:
Using the whisper dump I can see missing data, so it looks like there could be a persistence issue
Versions:
Additional info:
Happy to provide any other additional info.