Closed SJrX closed 3 years ago
I should also mentioned that I couldn't find an easy way to dump the contents of the whisper file, but I did check it's md5sum, and it is changing every 30 seconds.
Hi @SJrX
Logs above not saying anything about reading this data. What do you mean when say "metric can't be read"? It's not visible in UI? it shows "No data"? It shows empty graph? What about other metrics, are they visible? Visible metrics also using statsd or they are coming from carbon directly? If other statsd metrics are fine, just single metric is visible but just drop to 0 I would say it's really 0 - i.e. check your metric source. If all statsd metrics dropped to 0 - maybe that's statsd issue. I have no big experience with statsd, maybe process died or stopped working.
Logs above not saying anything about reading this data
I assume this is a read call, it only occurs when I try and read the metric.
23/08/2021 17:35:26 :: [query] [127.0.0.1:52882] cache query for "stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last" returned 0 values
What do you mean when say "metric can't be read"? It's not visible in UI? it shows "No data"? It shows empty graph?
I mean that I am getting zeros for values for everything.
What about other metrics, are they visible? Visible metrics also using statsd or they are coming from carbon directly?
A lot of metrics are displaying values correctly, but it's hard to find out which one's are not, looking at stats.guages.qa.end_to_end_tests.
there are two options under this main_branch or merge_request, and then the environment name comes after. These are just strings passed in via the CI job so the metric reports values for merge_request.staging_live
but not main_branch.staging_live
.
Visible metrics also using statsd or they are coming from carbon directly?
90% of the metrics that I am looking at are coming from statsd, the rest are from collectd straight to graphite, and then the internal carbon metrics.
If other statsd metrics are fine, just single metric is visible but just drop to 0 I would say it's really 0 - i.e. check your metric source.
Metric source is higher value, and statsd internally also says a higher value.
If all statsd metrics dropped to 0 - maybe that's statsd issue. I have no big experience with statsd, maybe process died or stopped working.
Nope it's just a handful of metrics that mysteriously dropped to zero. Let me try writing to graphite directly and see what it says and let me also grab a tcpdump.
I apologize as I was binary searching, I could have sworn that statsd had the right value of the metric but looking again I did see it was sending zero, and that log message I saw in carbon probably doesn't mean what I think it does. I just assumed that when it said returned 0 values, it meant there was no data.
Writing metrics to graphite directly did display correctly.
I'm having an issue where some metrics seem to be being written, but then can't be read. I'm using the graphiteapp/graphite-statsd container 1.1.8-1.
One such metric is
stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last
, statsd is writing this metric every 30 seconds, and as far as statsd is concerned this value should be about 180 (although I didn't check the wire with tcpdump).When I enable log of updates and creates, I see the following again and again:
I checked if the whisper files were corrupt, but the tool didn't say anything.
Now the weird thing is that this was working, it looks like data just stopped be written yesterday as the value for this drops to zero, there is at least one other metric that dropped to zero at the same time yesterday.
The service is running in docker on an ec2 instance, with about 6000 iops provisioned. We are using this to track rarely generated metrics (i.e., test results that take 10 minutes to run, and may run only every 4 hours), but then generate details about every test so a test run might burst 12000 metrics at once to statsd. Statsd is set to delete counters, so only gauges should regularly written to. I did have an issue where this wasn't the case, and so we had a ton of iops until I fixed it last week. The iops seem to peak at around 1000 iops right now. The
collectd.graphite.filecount-whisper.files
metric is reporting 303 K metrics (against most will rarely be written to). I am using sparse files to avoid disk space issues (although I'm not sure if that will actually help), and speed up generation. As far as my monitoring shows there are no blacklist or whitelist issues, no dropped creates, no errors, points per update is , and the committedPoints vs metrics receieved current value is the same (8.61 K).I'm still a bit foggy on the architecture of carbon, but I should just be using carbon-cache and not aggergator and relay.
Any thoughts on how to debug this.