Reconnect on RDS failover

matthewberryman commented 9 years ago

Currently if failover happens of our RDS instance for some reason, automatic or manual, the harvester falls over: 2014-11-30T20:57:33.817Z - error: uncaughtException: terminating connection due to administrator command, error: terminating connection due to administrator command at Connection.parseE (/home/ec2-user/cognicity-reports-powertrack/node_modules/pg/lib/connection.js:561:11) at Connection.parseMessage (/home/ec2-user/cognicity-reports-powertrack/node_modules/pg/lib/connection.js:390:17) at null. (/home/ec2-user/cognicity-reports-powertrack/node_modules/pg/lib/connection.js:92:20) at Socket.emit (events.js:95:17) at Socket. (_streamreadable.js:764:14) at Socket.emit (events.js:92:17) at emitReadable (_stream_readable.js:426:10) at emitReadable (_stream_readable.js:422:5) at readableAddChunk (_stream_readable.js:165:9) at Socket.Readable.push (_stream_readable.js:127:10) 2014-11-30T20:57:33.817Z - error: Fatal error: Application shutting down 2014-11-30T20:57:33.817Z - info: Exiting with status 1

Noting the documentation on what happens during failover, including the time frames, I would suggest the following: On database fail:

Wait three minutes and try reconnecting.
- If it still fails, wait another 3 minutes and try again.
- Try up to 5 reconnects before failing (bearing in mind on a manual / scheduled failover it may failover twice in a row).

matthewberryman commented 9 years ago

We'd also have to bear in mind the gnip time limits for automatic catchup. Can you please remind me what those are?

matthewberryman commented 9 years ago

For reference, the timings on failover for manual restart (due to instance type change): screenshot 2014-12-01 09 34 40

benatwork99 commented 9 years ago

We get 5 minutes of 'backfill' data from Gnip when we reconnect - i.e. if we drop the Gnip connection and reconnect within 5 minutes it sends us all that data immediately on connection.

We don't currently use 'replay' data, which is where we can select any time period from the previous 5 days and Gnip will send us that data. This is slow to stream (about 4x real-time speed) so I think we'd need some special handling to work with replay data. I was thinking about this earlier and we could have a special invocation of the harvester which would not run in the background but would run as an instance of an application which connected and processed replay data. This could be run at the same time as the harvester, I imagine. Also we could have the new invocation work by processing a file - this would be useful for testing and development, and may be a quicker way to work with replay data if it's only an occasional requirement. A developer could save the replay data to a file then invoke the harvester on the file.

Regarding the DB error - the stacktrace above is good, in a way. We're hitting the uncaughtException handler which was previously not caught, so we're getting logging. And this is just a type of exception which PG is throwing which we're currently not dealing with. The behavior to catch this error and try and reconnect PG would be all new and will need testing though, so I should work on this in the branch.

benatwork99 commented 9 years ago

I think we could just cache the Gnip results in memory until we reconnect. If we reconnect, great - push all the latest hits through the filter (which would be 15 minutes worth at most) so these should be timely enough to still be useful. If we fail to reconnect, tweet the admin and shutdown.

matthewberryman commented 9 years ago

Ok. Given that we can cache the Gnip results in memory, then it doesn't matter about the timing (w.r.t. Gnip) so much, so I think the above logic of trying every 3 minutes up to 5 retries still makes sense.

smart-facility / cognicity-reports-powertrack

Reconnect on RDS failover #5