backoff / retry for journal stream

yusefnapora commented 8 years ago

This will try to restart the journal stream if we get a recoverable GRPC error while iterating over it. It changes BlockchainFollower to accept a function that will open the stream, instead of the stream itself. Then in the event receiver thread, it wraps the whole "open and consume stream" process into a helper method, which it call using the with_retry helper. So, if it gets a recoverable error, it will try to reopen the stream and start again.

This will lead to duplicate entries on the output stream, if you get disconnected partway through, since the new journal stream will start over from the beginning. We should add some bookkeeping here to track the last received block, etc. But that's its own issue that we need to tackle separately.

yusefnapora commented 8 years ago

Okay, so after a bit of testing this morning, I've confirmed that this will reconnect if the rpc service goes down and comes back up within the retry period. The retry helper will max out at 60 seconds between attempts, but I set the default to 20 retry attempts, which should give us enough time to handle normal maintenance restarts.

One thing that worries me about this implementation is that, if the blockchain catchup was completed before the stream was interrupted, when the stream comes back up it won't re-run the catchup thread. So you could potentially miss some records if a new block was published during the downtime.

I think that can be addressed when we add the "catchup to known block" functionality, by just keeping track of the last seen block and restarting the catchup worker if need be.

parkan commented 8 years ago

Yeah using known block height sounds like the right fix

mediachain / oldchain-client

backoff / retry for journal stream #87