Carbon seems to be dropping metrics as new files are not create and some existing ones miss data points.

ecsumed commented 5 years ago

I'm load testing a new carbon setup running 1 relay to 3 carbons (all on different hosts). The load test runs 720k metrics every 150 seconds. Here are the graphs: https://imgur.com/Wn41iG2

Notice the discrepancies in relay metrics received and and relay metrics sent. And the one time that the relay did send the full amount of metrics, the carbons only received a fraction of them.

Also notice the files created. Eventually there should be a total of 720k files (90k hosts x 8 metrics). But they flat out. After about 40k on each of the 3 carbon hosts, files were created rarely far in between.

Here's my relay config:

[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
PICKLE_RECEIVER_MAX_LENGTH = 1048576

RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1

DESTINATIONS = <carbon1>:2004:a, <carbon2>:2004:b, <carbon3>:2004:c

MAX_QUEUE_SIZE = 1000000
MAX_DATAPOINTS_PER_MESSAGE = 1000

QUEUE_LOW_WATERMARK_PCT = 0.8
TIME_TO_DEFER_SENDING = 0.0001

USE_FLOW_CONTROL = True

USE_RATIO_RESET=False
MIN_RESET_STAT_FLOW=1000
MIN_RESET_RATIO=0.9
MIN_RESET_INTERVAL=121

And my 3 carbons config (all on different hosts):

[cache]
LINE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
CACHE_QUERY_INTERFACE = 0.0.0.0

ENABLE_TAGS = False

[cache:a]
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_PORT = 2004
CACHE_QUERY_PORT = 7002

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 50000
USE_FLOW_CONTROL = True

LOG_UPDATES = False
LOG_CREATES = True
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False

Carbon cache is set to inf as I do not want any points to doep so not sure what's happening. The only anomaly I found was the relay complaining about the destinations (carbons) down. The carbons are running fine though.

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/console.log <==
30/05/2019 09:22:15 :: <twisted.internet.tcp.Connector instance at 0x7fae634b95a8 disconnected IPv4Address(type='TCP', host='<CARBON-C-IP>', port=2004)> will retry in 2 seconds

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/clients.log <==
30/05/2019 09:22:15 :: CarbonClientFactory(<CARBON-C-IP>:2004:c)::clientConnectionLost (<CARBON-C-IP>:2004) Connection was closed cleanly.
30/05/2019 09:22:15 :: Destination is down: <CARBON-C-IP>:2004:c (1/5)

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/console.log <==
30/05/2019 09:22:15 :: Stopping factory CarbonClientFactory(<CARBON-C-IP>:2004:c)
30/05/2019 09:22:18 :: Starting factory CarbonClientFactory(<CARBON-B-IP>:2004:b)

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/clients.log <==
30/05/2019 09:22:18 :: CarbonClientFactory(<CARBON-B-IP>:2004:b)::startedConnecting (<CARBON-B-IP>:2004)
30/05/2019 09:22:18 :: CarbonClientProtocol(<CARBON-B-IP>:2004:b)::connectionMade
30/05/2019 09:22:18 :: CarbonClientFactory(<CARBON-B-IP>:2004:b)::connectionMade (CarbonClientProtocol(<CARBON-B-IP>:2004:b))
30/05/2019 09:22:18 :: Destination is up: <CARBON-B-IP>:2004:b

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/console.log <==
30/05/2019 09:22:18 :: Starting factory CarbonClientFactory(<CARBON-C-IP>:2004:c)

==> /opt/graphite/storage/log/carbon-relay/carbon-relay-a/clients.log <==
30/05/2019 09:22:18 :: CarbonClientFactory(<CARBON-C-IP>:2004:c)::startedConnecting (<CARBON-C-IP>:2004)
30/05/2019 09:22:18 :: CarbonClientProtocol(<CARBON-C-IP>:2004:c)::connectionMade
30/05/2019 09:22:18 :: CarbonClientFactory(<CARBON-C-IP>:2004:c)::connectionMade (CarbonClientProtocol(<CARBON-C-IP>:2004:c))
30/05/2019 09:22:18 :: Destination is up: <CARBON-C-IP>:2004:c

Version: 1.2.0

What am I missing?

piotr1212 commented 5 years ago

arent you maxing out the relay (CPU)? If so checkout https://github.com/grobian/carbon-c-relay which is much faster than the Python implementation and does multiprocessing. If you care about performance and you are building a new set-up., you might as well start out with https://github.com/lomik/go-carbon

ecsumed commented 5 years ago

Woah! @piotr1212 spot on. So the relay was causing the crashes and not the destination. Would a bigger server help here? In the meantime I'll checkout out the c-relay. I am testing on a new setup but only to find out how many disks I'll need for my production setup which I need to shard. A single disk is no longer feasible and is maxing out IO. My goal is to have a 0 carbon queue with support for 500k/10minutes metrics.

piotr1212 commented 5 years ago

This project (the original Graphite) is all written in Python. Python does not do multiprocessing (in a single process at least). This practically means that one process can only use one core at a time. With larger server you most likely mean one with more cores. This would not make any difference as the process cannot make use of those extra cores. In that case you would need to add a loadbalancer which balances over multiple relays.

Some parts of Graphite are rewritten in programming languages which do no have the limitation which Python has wrt multiprocessing. Examples are carbon-c-relay carbon-relay-ng and go-carbon. If you are building a new system I would go for those instead of the original Python implementation. The only original part to use would be graphite-web, as it still has no full featured replacement (that I am aware off).

ecsumed commented 5 years ago

Hey @piotr1212 . Thanks for suggesting relay and carbon variants. I ended up using c-relay with go-carbon and so far, it's great. Thanks

graphite-project / carbon

Carbon seems to be dropping metrics as new files are not create and some existing ones miss data points. #857