taskcluster / ec2-manager

Mozilla Public License 2.0
2 stars 14 forks source link

Instances which fail to be processed in the 20 SQS delivery attempts should be added to a table (optimzation) #11

Open jhford opened 7 years ago

jhford commented 7 years ago

Right now, if a state-change notification message is received for which we need to run the describeInstances API fails to process in the SQS message handler we will retry it for up to 20 times. When the 20th attempt fails, we currently drop this message on the floor in the system, but report it in Sentry. These messages are less frequent, but they should be handled better.

Right now, the hourly HouseKeeper.sweep() call will take care of adding these instances to the internal state, but we should have something quicker than that.

I think the right approach is to have a table instancespending which contains two columns id -- vchar(128) and received -- timestamptz. The dead letter exchange listener for the EC2 Event queue should take these messages and insert the ID into this table. There should be a poller which will poll this table and see if we have values from describeInstances yet. If found, we should upsert this instance into the database. If not found, we should continue to keep this instance in the table for at least twice as long as the HouseKeeper.sweep() delay.