influxdata / influxdb-relay

Service to replicate InfluxDB data for high availability
MIT License
855 stars 350 forks source link

proposed clarifications #7

Closed beckettsean closed 8 years ago

beckettsean commented 8 years ago

@toddboom @joelegasse @rkuchan

joelegasse commented 8 years ago

@beckettsean We've got some proposed changes in #9 that will add some retry logic and buffering to the relay. That should allow for small amounts of downtime to be recoverable without intervention. We'll need to update the README accordingly when that change goes in.

We do need to clean up the error handling logic to check for a 400 instead of just returning the first error response from any of the servers.

Regarding your inline question, though: I do not think the clients should be made aware of each backend failure. The failed writes will be logged, and we are planning on adding a status and statistics endpoints to the relay to better monitor its status. This might also allow for some automated logic for routing queries from the load balancer: if a backend has some failed writes, then remove it from the query pool and fire an alert.

beckettsean commented 8 years ago

@joelegasse thanks for the context, sounds good to me.

joelegasse commented 8 years ago

@beckettsean Can you review the updated README, and then update this PR with any new/updated clarifications?

beckettsean commented 8 years ago

@joelegasse I made a few suggested edits, and I've got one question. In the restore process, we tell users to create a backup of the shard and then restore that backup on the affected server. Is backup/restore necessary or could a simple file copy also work?

joelegasse commented 8 years ago

@beckettsean This still conflicts with the README around the buffering section. Most of the other changes look good, though.

As far as backup vs. copy. I'm pretty sure the backup/restore would be preferred, otherwise both servers would have to be down to backup and restore the data. Where, in theory, the backup can be done without taking the "good" server down. I think some testing of various failure and recovery scenarios would give us a better insight, since I'm mostly speculating.

beckettsean commented 8 years ago

@joelegasse I didn't try to change the meaning of the text, just clarify it. If there's a conflict then to me it seemed to be in the original, too. Can you point me to what is conflicting?

joelegasse commented 8 years ago

It's the paragraph that begins "The relay will listen...". It looks like git doesn't like the changes, I can make a best-effort guess, but I wanted to avoid making assumptions of what you thought would be clearer.

Can you rebase your changes to the latest master commit?

beckettsean commented 8 years ago

rebased and committed. Apologies, @joelegasse I didn't realize you meant actual Git conflicts, I thought you meant the text conflicted with reality.