go-graphite / go-carbon

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister
MIT License
803 stars 123 forks source link

[Feature Request] TCP stalls on cache overflow #199

Open GuillaumeConnan opened 7 years ago

GuillaumeConnan commented 7 years ago

Hi,

Currently, when a cache overflow occurs on the TCP reciever, metrics seems to be systematically dropped.

It is quite annoying that in case of unexpected load peaks on a shared infrastructure, legitimate metrics can be dropped too.

It could be very useful to implement a different behavior, like using TCP stalls or something else, that can deport the problem on the client-side (by telling them to wait before sending more data) or on a higher-level component like carbon-c-relay, which already implements this when its queue is full.

For that, TCP stalls seems to be a good approach because it is natively implemented by any TCP client, and that the client-side TCP queue should never grow too large.

What do you think about this ?

Thanks for your help !

deniszh commented 7 years ago

IMO relay cache usually is much less then carbon cache, not sure if it make sense. Just increase cache size on go-carbon.

On Mon, 28 Aug 2017 at 17:26, SilentHunter44 notifications@github.com wrote:

Hi,

Currently, when a cache overflow occurs on the TCP reciever, metrics seems to be systematically dropped.

It is quite annoying that in case of unexpected load peaks on a shared infrastructure, legitimate metrics can be dropped too.

It could be very useful to implement a different behavior, like using TCP stalls or something else, that can deport the problem on the client-side (by telling them to wait before sending more data) or on a higher-level component like carbon-c-relay, which already implements this when its queue is full.

For that, TCP stalls seems to be a good approach because it is natively implemented by any TCP client, and that the client-side TCP queue should never grow too large.

What do you think about this ?

Thanks for your help !

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lomik/go-carbon/issues/199, or mute the thread https://github.com/notifications/unsubscribe-auth/ABK51rAjS_gnsFUzuDu9HKxobRiWFZF-ks5sctw8gaJpZM4PErR7 .

GuillaumeConnan commented 7 years ago

Yeah that sounds good too, I didn't see it like that, it could be a solution :)

It brings another question to me: in most cases, relay pass data way much faster than go-carbon can write recieved data on whisper files.

So I don't know if a smaller queue on the relay can contain traffic peaks efficiently.

Maybe I bother me for nothing, do you ever had this kind of problem?

deniszh commented 7 years ago

in most cases, relay pass data way much faster than go-carbon can write recieved data on whisper files.

That's generally not true - otherwise, the whole system will not able to work, right? If one cache instance is not enough to accept all load you need to add another one and add them to cluster using graphite-web. But usually, it's fine, because of aggregation and caching data before writing to the disk.

xneo64 commented 7 years ago

@deniszh the problem @SilentHunter44 is mentioning usually occurs when a relay would be caching metrics for a carbon store that is currently down (e.g. for maintenance) - on starting the carbon store the relay would then dump all its cached metrics to the carbon store as fast as possible, sometimes even overwhelming go-carbon's caches. The question is where should a feature like this be implemented - in go-carbon via tcp stalls, or a throttle in the relay cache output?