graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

frontend not finding some metrics in carbon cache. How to determine whether this is a problem with carbon-cache or if the metrics are stuck in a relay? #782

Closed mwtzzz-zz closed 6 years ago

mwtzzz-zz commented 6 years ago

webapp is supposed to retrieve and display metrics that are in carbon cache's memory before they have been written to disk (in addition to the metrics that are already on disk).

I can verify this is working as expected for some of my metrics, by enabling LOG_CACHE_HITS, looking at the logs and seeing "query returned" 1+ results when I refresh my browser. The timestamp on disk will be older, and so I know that the webapp is successfully getting the values from both disk and in-cache.

But I have other metrics that are not returning any recent values until they have been flushed to disk. In some cases, this means waiting 30+ minutes. During this wait, the log shows "query return 0 results" when I refresh my browser. Data is not lost; eventually it makes it to disk, but in the meantime the browser is showing an absence of recent data.

I've done some testing and I don't know if this is because:

(a) metrics are in carbon-cache memory but carbon-cache cannot find them or (b) metrics are stuck on the relay and have not made it to carbon-cache yet

How do I figure it out? And what can be done to fix it?

mwtzzz-zz commented 6 years ago

Update: I'm pretty sure (but not positive) the problem is with carbon-cache. I did a test where I turned off the relays and then refreshed the browser. The idea being that by turning off the relays, the data would either get flushed to carbon-cache or would disappear with the relay.

The result was that still no recent data was showing up in the browser. Until about 10 minutes later when it was written to disk. This tells me the data is being held by carbon-cache but not being given when the front-end requests it.

Is there a way to log when carbon-cache receives a new metric from the line? This way we would know for sure it is receiving the metric.

More importantly, why would it not return metrics that it has in its cache? Maybe there is an internal queue limit being hit? I'm running six instances that are processing about 2 million updates /minute. If I increase this number from six to eight or ten, would that help with this issue?

piotr1212 commented 6 years ago

are these "new" metrics for which a .wsp file does not exist yet? Are you querying the right cache?

deniszh commented 6 years ago

@mwtzzz

webapp is supposed to retrieve and display metrics that are in carbon cache's memory before they have been written to disk (in addition to the metrics that are already on disk).

Unfortunately, that's not true. It's one of the oldest bugs which still exist - https://github.com/graphite-project/graphite-web/issues/629

mwtzzz-zz commented 6 years ago

@piotr1212 These are existing metrics that already have a .wsp file and I'm definitely querying the right cache.

@deniszh Well, shoot.... It sounds like adding cache-instances can help alleviate the problem, though? Some of my metrics values do get displayed even when they haven't made it to disk yet, so it works for some of them. Does this mean the problem is with the size of the data, or with the hash of the metric name?

i should note that we're not using wildcards.

deniszh commented 6 years ago

@mwtzzz

Some of my metrics values do get displayed even when they haven't made it to disk yet, so it works for some of them.

I never saw this behavior TBH.

mwtzzz-zz commented 6 years ago

@deniszh Can you help me troubleshoot and fix it?

mwtzzz-zz commented 6 years ago

@deniszh Here is an example of a metric that is not giving me trouble: atlantic_exchange.request.sum_all.hosts

Here is an exampe of a metric that is giving me trouble: atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts

deniszh commented 6 years ago

@mwtzzz : please check @piotr1212 comment - https://github.com/graphite-project/graphite-web/issues/629#issuecomment-396159341 Looks like you facing some different issue...

mwtzzz-zz commented 6 years ago

@deniszh Yes, looks like it's a different problem because mine involves .wsp files that already exist. Can you help me to troubleshoot? I'm stuck, I'm not sure what else to do.

piotr1212 commented 6 years ago

what is your setup? running a relay? hashing? If so make sure that carbonlink destionations in carbon config is literally equal to carbonlink_hosts in local_settings.py. No exceptions in query.log console.log or in webapp log? Which OS, Which Python version. Could you post your carbon.conf and local_settings.py online?

mwtzzz-zz commented 6 years ago

@piotr1212 i'm running a relay layer using carbon-c-relay with fn1v_ch hash no exceptions in the logs OS: Amzn linux 2015.09 python 2.7

Here's how the carbonlink and cluster_servers destinations look in local_settings.py:

CARBONLINK_HOSTS = ["127.0.0.1:7201:2105","127.0.0.1:7202:2107","127.0.0.1:7203:2109","127.0.0.1:7204:2111","127.0.0.1:7205:2113","127.0.0.1:7206:2115"]

CLUSTER_SERVERS = ['radar-be-a.mgmt.yyy.com:8030','radar-be-b.mgmt.yyy.com:8030','radar-be-c.mgmt.yyy.com:8030','radar-be-d.mgmt.yyy.com:8030','radar-be-e.mgmt.yyy.com:8030', ... ]

Here's my carbon.conf. There are six entries (2105, 2107, etc) with cache_query_port 7201, 7202, etc.:

[cache:2105]
ENABLE_LOGROTATION = True
USER = yyy
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 10000
MAX_CREATES_PER_MINUTE = 3000
USE_INSECURE_UNPICKLER = False
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CREATES = False
LOG_CACHE_QUEUE_SORTS = False
CACHE_WRITE_STRATEGY = timesorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = False
CARBON_METRIC_PREFIX = carbon.agents.radar123
CARBON_METRIC_INTERVAL = 60
ENABLE_AMQP = False
ENABLE_MANHOLE = False

ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 0
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2105
CACHE_QUERY_PORT = 7201
mwtzzz-zz commented 6 years ago

Here are some logs that might help:

From cache.log:

Mon Jun 11 11:08:09 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 0 datapoints
Mon Jun 11 11:08:12 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 11:08:12 2018 :: Data-Cache miss [e8da12dd07d2498aca62b1e42d1c829a]
Mon Jun 11 11:08:12 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 11:08:12 2018 :: Data-Cache miss [a6fadeacec0b735c4d577bb25e7ca05e]

F

mwtzzz-zz commented 6 years ago

@piotr1212 I found a clue.

I decided to test each cache instance one at a time. I edited local_settings.py and got rid of five entries in CARBONLINK_HOSTS, leaving only one. I iterated through these, and when I tested CARBONLINK_HOSTS = ["127.0.0.1:7205:2113"] the data points show up for this metric and cache.log shows 25 data points are retrieved:

Mon Jun 11 20:35:56 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 20:35:56 2018 :: Data-Cache miss [52fd8f9bd86cdba7ec015a07be13d28a]
Mon Jun 11 20:35:56 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 20:35:56 2018 :: Data-Cache miss [9d471edfc61931ce1613fc64cd86df7b]
Mon Jun 11 20:35:57 2018 :: CarbonLink sending request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts to ('127.0.0.1', '2113')
Mon Jun 11 20:35:57 2018 :: CarbonLink finished receiving atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts from ('127.0.0.1', '2113')
Mon Jun 11 20:35:57 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 25 datapoints

These 25 data points correspond to the 25 minutes that have elapsed since the whisper file was last updated.

But if I put CARBONLINK_HOSTS back to its original setting: CARBONLINK_HOSTS = ["127.0.0.1:7201:2105","127.0.0.1:7202:2107","127.0.0.1:7203:2109","127.0.0.1:7204:2111","127.0.0.1:7205:2113","127.0.0.1:7206:2115"] It no longer can retrieve the data points.

A hashing problem?

mwtzzz-zz commented 6 years ago

@piotr1212 @deniszh It's querying the wrong cache instance. It's querying 2115 instead of 2113:

Mon Jun 11 20:56:20 2018 :: Request-Cache miss [8c49f1d46f13caab4a3409ea7eeff4dc]
Mon Jun 11 20:56:20 2018 :: Data-Cache miss [b610953d155b2e73634893bc1a6b5d89]
Mon Jun 11 20:56:20 2018 :: Request-Cache miss [7f8cee809c167513113e667baf88e08e]
Mon Jun 11 20:56:20 2018 :: Data-Cache miss [0ba2e41252e01e4fab74db41b4f8b741]
Mon Jun 11 20:56:20 2018 :: CarbonLink sending request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts to ('127.0.0.1', '2115')
Mon Jun 11 20:56:20 2018 :: CarbonLink finished receiving atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts from ('127.0.0.1', '2115')
Mon Jun 11 20:56:20 2018 :: CarbonLink cache-query request for atlantic_exchange.imp.demand.neptune_direct.sum_all.hosts returned 0 datapoints

My relay writes to these cache instances like this:


cluster cache
  fnv1a_ch
    127.0.0.1:2105=2105
    127.0.0.1:2107=2107
    127.0.0.1:2109=2109
    127.0.0.1:2111=2111
    127.0.0.1:2113=2113
    127.0.0.1:2115=2115
  ;
mwtzzz-zz commented 6 years ago

I fixed it by updating graphite-web and adding CARBONLINK_HASHING_TYPE = 'fnv1a_ch' to local_settings and also installing pip scandir as per https://github.com/graphite-project/graphite-web/issues/1867

Apparently this problem had been with us for quite some time, but we didnt' realize it until we moved from fast nvme drives to slower EBS volumes.

deniszh commented 6 years ago

Cool, good to know that's solved now! Sorry, I was busy and not able to support you, unfortunately.