[Q] Metrics Written But Not Queriable

SJrX commented 3 years ago

I'm having an issue where some metrics seem to be being written, but then can't be read. I'm using the graphiteapp/graphite-statsd container 1.1.8-1.

One such metric is stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last, statsd is writing this metric every 30 seconds, and as far as statsd is concerned this value should be about 180 (although I didn't check the wire with tcpdump).

When I enable log of updates and creates, I see the following again and again:

23/08/2021 17:34:49 :: [updates] wrote 1 datapoints for stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last in 0.00014 seconds
23/08/2021 17:35:04 :: [query] [127.0.0.1:52868] cache query for "stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last" returned 0 values
23/08/2021 17:35:19 :: [updates] wrote 1 datapoints for stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last in 0.00007 seconds
23/08/2021 17:35:26 :: [query] [127.0.0.1:52882] cache query for "stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last" returned 0 values
23/08/2021 17:35:48 :: [updates] wrote 1 datapoints for stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last in 0.00007 seconds
23/08/2021 17:36:17 :: [updates] wrote 1 datapoints for stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last in 0.00005 seconds

I checked if the whisper files were corrupt, but the tool didn't say anything.

Now the weird thing is that this was working, it looks like data just stopped be written yesterday as the value for this drops to zero, there is at least one other metric that dropped to zero at the same time yesterday.

The service is running in docker on an ec2 instance, with about 6000 iops provisioned. We are using this to track rarely generated metrics (i.e., test results that take 10 minutes to run, and may run only every 4 hours), but then generate details about every test so a test run might burst 12000 metrics at once to statsd. Statsd is set to delete counters, so only gauges should regularly written to. I did have an issue where this wasn't the case, and so we had a ton of iops until I fixed it last week. The iops seem to peak at around 1000 iops right now. The collectd.graphite.filecount-whisper.files metric is reporting 303 K metrics (against most will rarely be written to). I am using sparse files to avoid disk space issues (although I'm not sure if that will actually help), and speed up generation. As far as my monitoring shows there are no blacklist or whitelist issues, no dropped creates, no errors, points per update is , and the committedPoints vs metrics receieved current value is the same (8.61 K).

I'm still a bit foggy on the architecture of carbon, but I should just be using carbon-cache and not aggergator and relay.

# cat storage-schemas.conf
# Probably redundant
[carbon]
pattern = ^carbon\.
retentions = 30s:60d

[default]
pattern = .*
retentions = 30s:60d

# grep -v "^#" carbon.conf  | grep -v "^$"
[cache]
DATABASE = whisper
ENABLE_LOGROTATION = True
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 288000
MIN_TIMESTAMP_RESOLUTION = 30
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = False
LOG_UPDATES =True
LOG_CREATES = True
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = False
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_SPARSE_CREATE = True
WHISPER_FALLOCATE_CREATE = False
CARBON_METRIC_INTERVAL = 30
GRAPHITE_URL = http://127.0.0.1:8080
ENABLE_TAGS = False
LOG_LISTENER_CONN_SUCCESS = False

Any thoughts on how to debug this.

SJrX commented 3 years ago

I should also mentioned that I couldn't find an easy way to dump the contents of the whisper file, but I did check it's md5sum, and it is changing every 30 seconds.

deniszh commented 3 years ago

Hi @SJrX

Logs above not saying anything about reading this data. What do you mean when say "metric can't be read"? It's not visible in UI? it shows "No data"? It shows empty graph? What about other metrics, are they visible? Visible metrics also using statsd or they are coming from carbon directly? If other statsd metrics are fine, just single metric is visible but just drop to 0 I would say it's really 0 - i.e. check your metric source. If all statsd metrics dropped to 0 - maybe that's statsd issue. I have no big experience with statsd, maybe process died or stopped working.

SJrX commented 3 years ago

Logs above not saying anything about reading this data

I assume this is a read call, it only occurs when I try and read the metric.

23/08/2021 17:35:26 :: [query] [127.0.0.1:52882] cache query for "stats.gauges.qa.end_to_end_tests.main_branch.staging_live.test_run.totals.scenario_results.total.last" returned 0 values

What do you mean when say "metric can't be read"? It's not visible in UI? it shows "No data"? It shows empty graph?

I mean that I am getting zeros for values for everything.

What about other metrics, are they visible? Visible metrics also using statsd or they are coming from carbon directly?

A lot of metrics are displaying values correctly, but it's hard to find out which one's are not, looking at stats.guages.qa.end_to_end_tests. there are two options under this main_branch or merge_request, and then the environment name comes after. These are just strings passed in via the CI job so the metric reports values for merge_request.staging_live but not main_branch.staging_live.

Visible metrics also using statsd or they are coming from carbon directly?

90% of the metrics that I am looking at are coming from statsd, the rest are from collectd straight to graphite, and then the internal carbon metrics.

If other statsd metrics are fine, just single metric is visible but just drop to 0 I would say it's really 0 - i.e. check your metric source.

Metric source is higher value, and statsd internally also says a higher value.

If all statsd metrics dropped to 0 - maybe that's statsd issue. I have no big experience with statsd, maybe process died or stopped working.

Nope it's just a handful of metrics that mysteriously dropped to zero. Let me try writing to graphite directly and see what it says and let me also grab a tcpdump.

SJrX commented 3 years ago

I apologize as I was binary searching, I could have sworn that statsd had the right value of the metric but looking again I did see it was sending zero, and that log message I saw in carbon probably doesn't mean what I think it does. I just assumed that when it said returned 0 values, it meant there was no data.

Writing metrics to graphite directly did display correctly.

graphite-project / carbon

[Q] Metrics Written But Not Queriable #923