graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

[BUG] carbon-cache always returns 0 data points for cached metric, but writes the metric to disk regularly #940

Open percygrunwald opened 2 years ago

percygrunwald commented 2 years ago

Describe the bug We have a graphite cluster with approximately 130 million metrics spread across 3 storage nodes running graphite-web and 16 instances of carbon-cache. For a small proportion of our metrics (I'm unsure about the exact percentage, but from my experience it seems like <1%), carbon-cache always returns 0 data points. We know for a fact that carbon-cache is receiving the metrics in question though, because the whisper file is regularly updated. The way this problem manifests is that for certain metrics, the latest data point can be quite far away from "now", periodically "catching up" when carbon-cache flushes its data points to the whisper file.

I have verified with tcpdump that carbon-c-relay is sending the metrics to the same carbon-cache instance that graphite-web tries to query for the cached data points.

I feel like this might have something to do with hashing, since the same metric name exhibits the same problem in both clusters that receive the metric independently, and that metrics one character off (e.g. same metric from host124 instead of host123) do not exhibit the same problem. We have at least 3 instances of the same metric on different host exhibiting the problem (out of around 500 hosts reporting the metric), which gives me further cause to believe the issue is not related to something other than carbon.

The logs we're seeing:

76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints

Metric exhibiting problem:

image

Metric of same name from other host (path differing by one character):

image

Note that I have changed the exact metric name to not expose our metric publicly, but I'm happy to provide the metric name privately for debugging (assuming the issue may be related to the exact metric name).

To Reproduce We are able to replicate the behavior by querying a metric of the same name for different hosts, e.g. statsd.host123-primary.some_metric.count. The metric returns data, but is lagging far behind. The graphite-web cache logs show that the metric always returns 0 data points. Querying for the same metric where the path differs by one character e.g. statsd.host124-primary.some_metric.count, statsd.host125-primary.some_metric.count, does not exhibit the problem.

Expected behavior carbon-cache should return data points for the metrics in question, given that it appears to have the data in memory (evidenced by the fact that it's writing data to disk at regular intervals).

Environment (please complete the following information):

deniszh commented 2 years ago

Hi @percygrunwald

Looks like a bug indeed but not sure how it can be investigated externally. :( Could you please share your configs for start please (for carbon and graphite-web)

percygrunwald commented 2 years ago

Thanks for your reply, I will provide both configurations.

A representative carbon-cache carbon.conf:

...
[cache:9]
STORAGE_DIR = /mnt/graphite/storage/
CONF_DIR = /opt/graphite/conf/
LOG_DIR = /var/log/carbon/
LOCAL_DATA_DIR = /mnt/ssd/graphite/storage/whisper/
USER = graphite
MAX_CACHE_SIZE = 200000000
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = inf
LINE_RECEIVER_INTERFACE = 127.0.0.1
LINE_RECEIVER_PORT = 2436
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 127.0.0.1
UDP_RECEIVER_PORT = 2437
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
PICKLE_RECEIVER_PORT = 2438
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 127.0.0.1
CACHE_QUERY_PORT = 2439
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
ENABLE_TAGS = False
...

graphite-web config:

SECRET_KEY = '...'
TIME_ZONE = 'America/Los_Angeles'
LOG_ROTATION = True
LOG_RENDERING_PERFORMANCE = True
LOG_CACHE_PERFORMANCE = True
import os
_pid = os.getpid()
LOG_FILE_INFO = "info.%d.log" % _pid
LOG_FILE_EXCEPTION = "exception.%d.log" % _pid
LOG_FILE_CACHE = "cache.%d.log" % _pid
LOG_FILE_RENDERING = "rendering.%d.log" % _pid
DEFAULT_XFILES_FACTOR = 0
FIND_TIMEOUT = 60.0  # Timeout for metric find requests
FETCH_TIMEOUT = 60.0  # Timeout to fetch series data
CONF_DIR = '/opt/graphite/conf'
STORAGE_DIR = '/mnt/ssd/graphite/storage'
STATIC_ROOT = '/opt/graphite/static'
LOG_DIR = '/opt/graphite/storage/log/webapp'
INDEX_FILE = '/opt/graphite/storage/index'     # Search index file
WHISPER_DIR = '/mnt/ssd/graphite/storage/whisper'
DATABASES = {
    'default': {
        'NAME': '/opt/graphite/storage/graphite.db',
        'ENGINE': 'django.db.backends.sqlite3',
        'USER': '',
        'PASSWORD': '',
        'HOST': '',
        'PORT': ''
    }
}
INTRACLUSTER_HTTPS = True
CARBONLINK_HOSTS = [
    '127.0.0.1:2403:0',
    '127.0.0.1:2407:1',
    '127.0.0.1:2411:2',
    '127.0.0.1:2415:3',
    '127.0.0.1:2419:4',
    '127.0.0.1:2423:5',
    '127.0.0.1:2427:6',
    '127.0.0.1:2431:7',
    '127.0.0.1:2435:8',
    '127.0.0.1:2439:9',
    '127.0.0.1:2443:10',
    '127.0.0.1:2447:11',
    '127.0.0.1:2451:12',
    '127.0.0.1:2455:13',
    '127.0.0.1:2459:14',
    '127.0.0.1:2463:15',
]
CARBONLINK_TIMEOUT = 60.0
FUNCTION_PLUGINS = []

I understand this would be hard to diagnose remotely. I'm happy to run any tests that would help. Thank you.

deniszh commented 2 years ago

@percygrunwald : yes, config looks perfectly fine, indeed. Will think how better to debug that. Maybe we have some hashing issue and graphite-web checking different carbonlink instead for proper one for some metrics... BTW, instead of multiple carbon caches you can consider to use go-carbon instead, but please take note that it has no aggregator, though.

percygrunwald commented 2 years ago

@deniszh thank you for your feedback.

Maybe we have some hashing issue and graphite-web checking different carbonlink instead for proper one for some metrics...

In my testing, tcpdump output seems to show that graphite-web is reading from the same carbon-cache instance that carbon-c-relay is writing to.

15:23:29.925423 IP localhost.43704 > localhost.2439: Flags [P.], seq 270706700:270706830, ack 2015752604, win 64, options [nop,nop,TS val 1165724566 ecr 1165596638],
 length 130
E...x.@.@.............  .."..x%.....@.......
E{..Ey.....~..}q.(U.metricq.XN...statsd.host123-primary.some_metric.countq.U.typeq.U.cache-queryq.u.

This is only the write side, but it's the correct port for carbon-cache:9. On the read side, it appears as though graphite-web is querying carbon-cache:9 as well:

76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints

Unless the log is wrong, it appears as though the write and read are happening to the same carbon-cache instance (9).

BTW, instead of multiple carbon caches you can consider to use go-carbon instead

Thank you for the suggestion. We make use of go-carbon in our new architecture and it has worked very well as a carbon replacement.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

deniszh commented 1 year ago

Looks like big really exists, but I still very confused with it.

itinneed2022 commented 1 year ago

Hello Deniszh, I think this is the same issue I'm experiencing on my setup. Is there anyway to check the carbon daemons directly to see what data they're currently holding?

deniszh commented 1 year ago

Sorry, could you please elaborate what you mean?

ср, 19 июл. 2023 г., 18:35 itinneed2022 @.***>:

Hello Deniszh, I think this is the same issue I'm experiencing on my setup. Is there anyway to check the carbon daemons directly to see what data they're currently holding?

— Reply to this email directly, view it on GitHub https://github.com/graphite-project/carbon/issues/940#issuecomment-1642408343, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJLTVX77VQWELPWUTBGMRTXRAEETANCNFSM6AAAAAAR23KCUU . You are receiving this because you modified the open/close state.Message ID: @.***>

itinneed2022 commented 1 year ago

@deniszh I'm seeing the same error @percygrunwald is seeing.

2023-07-19,14:45:16.451 :: CarbonLink finished receiving test.density-full.47_Clare.47_C.fullness from ('127.0.0.1', 'b')
2023-07-19,14:45:16.451 :: Carbonlink results: {'datapoints': []}

I created a question on graphite-web last year. In my cache logs I've noticed some metrics do return datapoints, others do not.

Is there anyway to query the carbon cache daemon's directly and see if the metrics do exist in the caches?

itinneed2022 commented 1 year ago

I added some lines to my carbonlink.py file to try and narrow down the issue. Here's a snippet:

2023-07-20,13:46:14.823 :: Conn is: <socket.socket fd=30, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 54484), raddr=('127.0.0.1', 7002)>
2023-07-20,13:46:14.823 :: Metric requested is: test.density-full.Broad_Flat.Broad_Flat_14.client_count
2023-07-20,13:46:14.823 :: CarbonLink sending request for test.density-full.Broad_Flat.Broad_Flat_14.client_count to ('127.0.0.1', 'a')
2023-07-20,13:46:14.823 :: len_prefix in recv_response is: b'\x00\x00\x00-'
2023-07-20,13:46:14.823 :: body size in recv_response is: 45
2023-07-20,13:46:14.824 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as.'
2023-07-20,13:46:14.824 :: len_prefix in recv_response is: b'\x00\x00\x00-'
2023-07-20,13:46:14.824 :: body size in recv_response is: 45
2023-07-20,13:46:14.824 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as.'
2023-07-20,13:46:14.824 :: Made it inside the try, result is: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: result2 is: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: CarbonLink finished receiving test.density-full.Broad_Flat.Broad_Flat_14.client_count from ('127.0.0.1', 'a')
2023-07-20,13:46:14.824 :: Carbonlink results: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: CarbonLink cache-query request for test.density-full.Broad_Flat.Broad_Flat_14.client_count returned 1 datapoints
2023-07-20,13:47:43.056 :: Request-Cache miss [5df10b47fc6a60a71d25ff3c8bae38b4]
2023-07-20,13:47:43.056 :: Data-Cache miss [d15e419cbcaa3638f54205c85c887466]
2023-07-20,13:47:49.075 :: Request-Cache miss [6fa36018ab95e559b1c4deb71d1f6dae]
2023-07-20,13:47:49.076 :: Data-Cache miss [f523d7ea713e59b4d97153c6f1a6659e]
2023-07-20,13:47:49.094 :: Host is: ('127.0.0.1', 'a')
2023-07-20,13:47:49.094 :: CarbonLink creating a new socket for ('127.0.0.1', 'a')
2023-07-20,13:47:49.095 :: Carbonlink connection successful
2023-07-20,13:47:49.095 :: Conn is: <socket.socket fd=36, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 55548), raddr=('127.0.0.1', 7002)>
2023-07-20,13:47:49.095 :: Metric requested is: CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle
2023-07-20,13:47:49.095 :: CarbonLink sending request for CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle to ('127.0.0.1', 'a')
2023-07-20,13:47:49.097 :: len_prefix in recv_response is: b'\x00\x00\x00\x1b'
2023-07-20,13:47:49.097 :: body size in recv_response is: 27
2023-07-20,13:47:49.098 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02s.'
2023-07-20,13:47:49.098 :: len_prefix in recv_response is: b'\x00\x00\x00\x1b'
2023-07-20,13:47:49.098 :: body size in recv_response is: 27
2023-07-20,13:47:49.098 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02s.'
2023-07-20,13:47:49.098 :: Made it inside the try, result is: {'datapoints': []}
2023-07-20,13:47:49.098 :: result2 is: {'datapoints': []}
2023-07-20,13:47:49.098 :: CarbonLink finished receiving CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle from ('127.0.0.1', 'a')
2023-07-20,13:47:49.098 :: Carbonlink results: {'datapoints': []}
2023-07-20,13:47:49.098 :: CarbonLink cache-query request for CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle returned 0 datapoints
2023-07-20,13:47:49.969 :: Request-Cache miss [5df10b47fc6a60a71d25ff3c8bae38b4]
2023-07-20,13:47:49.969 :: Data-Cache miss [d15e419cbcaa3638f54205c85c887466]
2023-07-20,13:47:50.412 :: Request-Cache miss [d6f5aae245b305a127886bf5de73f3ce]
2023-07-20,13:47:50.412 :: Data-Cache miss [fe44d544439db17608cd6d177e3bbf8e]

Something I'm consistently noticing in the logs is that any datapoint that returns empty has a byte string that ends in: datapointsq\x01]q\x02s.

Where are successfully retrieved bytes always look like: 00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as. I don't know how to decode the byte strings but I'm thinking they are not properly formed?