influxdata / influxdb-relay

Service to replicate InfluxDB data for high availability
MIT License
855 stars 350 forks source link

Add retry logic to http requests #9

Closed nathanielc closed 8 years ago

nathanielc commented 8 years ago

This add retry logic to the HTTP backends. Obviously it doesn't make sense to add retry logic to the UDP backend. The intent of this logic is reduce the number of failures during short outages or periodic network issues. _This retry logic is not sufficient for for long periods of downtime as all data is buffered in RAM _

Config options

With these two config options is should be easy to reason about your fault tolerance properties. For example if MaxRetryTime is 1m than a backend server cannot be down more than a minute or it will be guaranteed to be out of sync. BufferSize should to be large enough to buffer all write operations for MaxRetyTime, empirically you should be able to measure RAM usage as needed.

Each backend has its own buffer and retries are serialized to each backend. This should prevent stampeeding of requests once a backend server recovers from an outage.

TODO:

NOTE: I also implemented the HTTP timeout since it was a configuration option but did not work. (I ran unto that bug during testing)

~~NOTE: This PR adds one new dependency on https://github.com/cenkalti/backoff I thought about copy/pasting the needed bits but that mean nearly all of the repo, so I decided importing was better than copy/paste here. Its a simple well written package. I could be convinced otherwise if someone else feels strongly.~~ Dep has been removed

nathanielc commented 8 years ago

@joelegasse Here is an updated version...

nathanielc commented 8 years ago

@joelegasse How does this look now?

joelegasse commented 8 years ago

Looks good to me, still has [WIP] in the title, though. Have you tested this out locally?

nathanielc commented 8 years ago

@joelegasse I have tested locally, I was able to buffer up several thousand writes when a backend was down and to see them all written once the backend came back online. The backoff worked as expected. I was also able to fill up the buffer and see that errors are correctly logged and returned to the client.