openedx-unsupported / edx-analytics-configuration

GNU Affero General Public License v3.0
8 stars 28 forks source link

Handle transient failures in provisioning master instance. #18

Closed brianhw closed 8 years ago

brianhw commented 8 years ago

EMR may fail to provision a master instance on the first try. So calling list-instances with an instance-group-type of 'MASTER' may indeed return more than one instance. The failed instances have a State of "TERMINATED" with code "INSTANCE_FAILURE". Rather than checking all of these, we just assume instead that the "good" master is the one at the end of the list.

@mulby

brianhw commented 8 years ago

By the way, I ran this successfully on the instance that had originally failed as reported in AN-6641.

mulby commented 8 years ago

:+1:

Optional: would it be trivial to find the first instance in the list that is in a valid state? If so I would slightly prefer the explictness of such an approach.

brianhw commented 8 years ago

Well, this doesn't happen very often, which makes reproducing it for testing a little difficult. Also, one tends to find things in the last place they look, so that's the answer we want to return here. I wouldn't know what to do in a case where my logic for finding an instance in a valid state didn't actually find the last instance. I would prefer to just take the last one and not try to get the states right.

mulby commented 8 years ago

seems reasonable: merge when ready