graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.89k stars 1.26k forks source link

Carbon cache metrics lookup incorrect #1520

Closed charlesdunbar closed 4 years ago

charlesdunbar commented 8 years ago

Using 0.9.15, but the issue may exist on master as well.

When running multiple carbon-caches on a machine, the carbon-cache metrics have the potential to not be queried from memory, only what's on disk. This leaves some blank points on the graph until carbon-cache flushes the carbon metrics to disk.

An example I'm running into is trying to find the metric "carbon.agents.graphite-be2-prod-b.cache.size". From what I can understand from the carbon-cache code and using manhole, these carbon metrics live in the MetricCache of the specific cache, in this case cache:b.

ssh root@127.0.0.1 -p 7223 # manhole port of cache:b
>>> from carbon.cache import MetricCache
>>> MetricCache['carbon.agents.graphite-be2-prod-b.cache.size']
    deque([(1463625857.071111, 552006), (1463625917.071114, 553358), (1463625977.071121, 556979), (1463626037.071193, 558467), (1463626097.071138, 560311), (1463626157.071113, 559504), (1463626217.071104, 562210)])

The issue is those metrics don't appear when accessing the data via the web interface. JSON output of the relevant timeframe:

[{"target": "carbon.agents.graphite-be2-prod-b.cache.size", "datapoints":
...
...
[null, 1463625840], [null, 1463625900], [null, 1463625960], [null, 1463626020], [null, 1463626080], [null, 1463626140], [null, 1463626200], [null, 1463626260]]}]

I've configured CARBONLINK_HOSTS to include every cache_query_port of my local machine, 16 instances in this case.

CARBONLINK_HOSTS = ['127.0.0.1:7002:a','127.0.0.1:7102:b','127.0.0.1:7202:c','127.0.0.1:7302:d','127.0.0.1:7402:e','127.0.0.1:7502:f','127.0.0.1:7602:g','127.0.0.1:7702:h','127.0.0.1:7802:i','127.0.0.1:7902:j','127.0.0.1:8002:k','127.0.0.1:8102:l','127.0.0.1:8202:m','127.0.0.1:8302:n','127.0.0.1:8402:o','127.0.0.1:8502:p']

Doing a tcpdump of the localhost interface and all of those ports only shows 7702 (cache:h) being accessed when performing the query. Since cache:h doesn't actually have any carbon-cache metrics for cache:b, I assume that's why I'm seeing nulls.

# tcpdump -nnvvXSs 1514 -i lo port 7002 or port 7102 or port 7202 or port 7302 or port 7402 or port 7502 or port 7602 or port 7702 or port 7802 or port 7902 or port 8002 or port 8102 or port 8202 or p
    ort 8302 or port 8402 or port 8502
    tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 1514 bytes
    20:01:05.612174 IP (tos 0x0, ttl 64, id 33358, offset 0, flags [DF], proto TCP (6), length 155)
        127.0.0.1.51673 > 127.0.0.1.7702: Flags [P.], cksum 0xfe8f (incorrect -> 0x0461), seq 447742954:447743057, ack 2144911844, win 367, options [nop,nop,TS val 1039169642 ecr 1039136912], length 103
            0x0000:  4500 009b 824e 4000 4006 ba0c 7f00 0001  E....N@.@.......
            0x0010:  7f00 0001 c9d9 1e16 1ab0 03ea 7fd8 c1e4  ................
            0x0020:  8018 016f fe8f 0000 0101 080a 3df0 786a  ...o........=.xj
            0x0030:  3def f890 0000 0063 8002 7d71 0128 5507  =......c..}q.(U.
            0x0040:  6d65 7472 6963 7371 025d 7103 552c 6361  metricsq.]q.U,ca
            0x0050:  7262 6f6e 2e61 6765 6e74 732e 6772 6170  rbon.agents.grap
            0x0060:  6869 7465 2d62 6532 2d70 726f 642d 622e  hite-be2-prod-b.
            0x0070:  6361 6368 652e 7369 7a65 7104 6155 0474  cache.sizeq.aU.t
            0x0080:  7970 6571 0555 1063 6163 6865 2d71 7565  ypeq.U.cache-que
            0x0090:  7279 2d62 756c 6b71 0675 2e              ry-bulkq.u.
    20:01:05.612848 IP (tos 0x0, ttl 64, id 16379, offset 0, flags [DF], proto TCP (6), length 140)
        127.0.0.1.7702 > 127.0.0.1.51673: Flags [P.], cksum 0xfe80 (incorrect -> 0x62c5), seq 2144911844:2144911932, ack 447743057, win 359, options [nop,nop,TS val 1039169642 ecr 1039169642], length 88
            0x0000:  4500 008c 3ffb 4000 4006 fc6e 7f00 0001  E...?.@.@..n....
            0x0010:  7f00 0001 1e16 c9d9 7fd8 c1e4 1ab0 0451  ...............Q
            0x0020:  8018 0167 fe80 0000 0101 080a 3df0 786a  ...g........=.xj
            0x0030:  3df0 786a 0000 0054 8002 7d71 0155 1264  =.xj...T..}q.U.d
            0x0040:  6174 6170 6f69 6e74 7342 794d 6574 7269  atapointsByMetri
            0x0050:  6371 027d 7103 552c 6361 7262 6f6e 2e61  cq.}q.U,carbon.a
            0x0060:  6765 6e74 732e 6772 6170 6869 7465 2d62  gents.graphite-b
            0x0070:  6532 2d70 726f 642d 622e 6361 6368 652e  e2-prod-b.cache.
            0x0080:  7369 7a65 7104 5d71 0573 732e            sizeq.]q.ss.
    20:01:05.612870 IP (tos 0x0, ttl 64, id 33359, offset 0, flags [DF], proto TCP (6), length 52)
        127.0.0.1.51673 > 127.0.0.1.7702: Flags [.], cksum 0xfe28 (incorrect -> 0xc190), seq 447743057, ack 2144911932, win 367, options [nop,nop,TS val 1039169642 ecr 1039169642], length 0
            0x0000:  4500 0034 824f 4000 4006 ba72 7f00 0001  E..4.O@.@..r....
            0x0010:  7f00 0001 c9d9 1e16 1ab0 0451 7fd8 c23c  ...........Q...<
            0x0020:  8010 016f fe28 0000 0101 080a 3df0 786a  ...o.(......=.xj
            0x0030:  3df0 786a                                =.xj

Using carbonate and carbon-lookup, I see that the hash ring is expecting that metric to exist on cache:h, which is why I see it being accessed via a tcpdump.

# carbon-lookup -C carbonlink carbon.agents.graphite-be2-prod-b.cache.size
127.0.0.1:7702:h

The issue appears to be https://github.com/graphite-project/graphite-web/blob/0.9.15/webapp/graphite/render/datalib.py#L116-L118 is used to determine which cache to query, which works as expected for every metric except the carbon-cache metrics, which are special and not routed like every other metric.

I don't use carbon-relay or carbon-aggregate, but it looks to only be a carbon-cache issue. lib/carbon/instrumentation.py calls cache.MetricCache.store(fullMetric, datapoint) for a carbon-cache metric, while relay and aggregate use events.metricGenerated(fullMetric, datapoint).

Not sure if the correct path is to query the specific instance for carbon-cache metrics in datalib.py, or if cabon-cache metrics in instrumentation.py should also use events.metricGenerated to get the metric routed.

cbowman0 commented 8 years ago

I know this exact problem. This is the patch on master that deals with it by querying all caches: https://github.com/graphite-project/graphite-web/commit/48bbfbe073df7852625b9462907ac56f9d65a297

deniszh commented 8 years ago

Please note that it's only for carbon.* metrics - it's processed in special way.

charlesdunbar commented 8 years ago

@cbowman0 - Thanks for the quick response! I'll look into applying that patch.

@deniszh - I think it's only carbon.agent.* metrics - happy to rename the issue for clarity.

charlesdunbar commented 8 years ago

Follow up question - is there any place to track when/if master gets released to a version? Just noticed how long ago that patch was committed. Is master what 0.10 is going to be, or always just bleeding edge?

deniszh commented 8 years ago

Until now, master was not released, it's always bleeding edge. All releases were done from 0.9.x branch. But next major release will be 1.0 from master branch, still not clear when though.

deniszh commented 7 years ago

Hello @charlesdunbar, We tagged 1.0.0-rc1 from master now, please test it

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.