graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

Carbon Cache to 0 #678

Closed ttftw closed 5 years ago

ttftw commented 7 years ago

image

image

Earlier today, I noticed this happen...and a little later someones metrics started to hang...where it would update a gauge, then on the next tick, revert to old value. I restarted graphite and it fixed the issue, but a few hours later, the cache went to 0 again. It's still showing cache queries metrics, and the other metrics are not stalling like before.

Is this typical?

ttftw commented 7 years ago

Some stats die, and when I restart, they start right back up.

image

deniszh commented 7 years ago

Do you have something in the logs? It's not really clear what symptoms of the problem are?

ttftw commented 7 years ago

I've tried looking through them and didn't see anything that stuck out.

Where would I start to diagnose? Twice now I've had metrics freeze where I've had to restart and things will start flowing in again. It's like the cache is freezing or something I'm just not sure where to check.

piotr1212 commented 7 years ago

You could check cpu usage, disk IO usage, open file descriptors, memory usage, cache size, attach strace to the process or gdb with python extensions to get a backtrace. os logs, etc. Which python version are you running? I've experienced issues with garbage collection freezing carbon on python 2.6. This was with huge cachesized though (> 300.000 unique metrics with >100 points per metric in one carbon-cache)

ttftw commented 7 years ago

The cache size is what I'm saying is wrong. It dropped to zero before they would start to freeze. Though they seemed to keep coming in for a bit but then they hang. The gauge goes up like it received an update, but on the next tick it goes back to the original value and it will go back down until the next event it receives, it will go up, then back down again next tick. I can attach a screen shot here in a bit. It looks like a flat line with peaks when this gauge should be an increasing slope.

I've looked through what logs I can find and haven't noticed anything. I also have metrics of the server from netdata and nothing stands out. All the I/O, mem, cpu, etc seem typical.

I'll check python when I get back, but there's only about 9k updates happening at a time.

Sorry I'm a little short, I'm on mobile. :)

piotr1212 commented 7 years ago

I have no clue. Maybe some issue with actually receiving the metrics over the network, you could check open sockets, network stats, tcpdump, etc to verify....

piotr1212 commented 7 years ago

ps. might help if you specify which Graphite version, Python version, OS type/version, number of metrics, type of setup (number of caches/relays), etc.

ttftw commented 7 years ago

I'll do some digging around a bit more, I just don't know what might cause the cache to drop to zero. It seems like it's related. Not sure if that is normal behavior when we have 8-9k events coming in.

image

piotr1212 commented 7 years ago

In the second image the number of metrics received goes down. If the number of metrics received is less than the system can handle (and is lower than MAX_UPDATES) the cache will go to zero, points per update goes down to 1. This seems perfectly fine.

Question for you would be to find out why the number of metrics received drops from 9k to 6k.

If you still need help please give some more information about your setup, eg: number of caches, relays, python version, os version, graphite version, how the metrics are sent, wether you use statsd or not, etc.