grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Internal metrics not sent until restarted #452

Closed mvaldes14 closed 1 year ago

mvaldes14 commented 1 year ago

Got an autoscaling group running several relays and they work pretty good until i have to scale them up. Which starts the process of spinning new instances and they are listening and sending metrics to the cache nodes (confirmed by running tcpdumps) but all of the internal metrics are not being sent down to the cache nodes even tho they are set to do so.

  submit every 60 seconds
  reset counters after interval
  prefix with monitoring.carbon-relay.ip-10-218-15-38.relay
;

So i have to forcefully restart the process and couple seconds later the metrics start to come in... which based on what i understand forces the relays to drop whatever number of metrics they have queued, not an ideal path, so not exactly sure what i can provide to demonstrate this as the instances come up with an AMI with the carbon-c-relay baked in and running.

This issue happens every single time we scale relays.

grobian commented 1 year ago

I'm assuming you "scale" by sending a HUP to the process so it re-reads its config?

mvaldes14 commented 1 year ago

right so part of the boot process send a reload (HUP) but seems like its not sufficient as the reported metric is still the one the AMI had. we end up restarting each new instance.

So i guess is there any way for the process to detect that when the self reporting ip changes, it actually restarts the process entirely? cause the HUP doesn't seem to be enough.

grobian commented 1 year ago

I still don't really get what's happening. I'm not familiar with AMI and scaling. Does the prefix change when you scale?

I don't understand how HUP is involved in the scaling process. It seems you're starting new ones. You are confident you're not running in submission mode (-s)?

mvaldes14 commented 1 year ago

The configuration is static as its managed by chef entirely so we use the same configuration on each instance we bring up. So no submission mode and the prefixes are pretty much the same across the board. Based on what i see we need to put somewhere in our cookbooks that the entire process needs to be restarted at boot so it pushes out the internal metrics to the carbon cache cluster.

The thing with AMIs is that since they have everything pre-baked all processes inherit the initial configuration when you baked that AMI. So everything just starts working under the initial configuration.

This is most likely not an issue with the relay but the whole logic on how it's deployed at our end, thanks for looking into it tho!