openedx-unsupported / edx-analytics-configuration

GNU Affero General Public License v3.0
8 stars 28 forks source link

Fix issue when some clusters do not have names #6

Closed mulby closed 9 years ago

mulby commented 9 years ago

This broke all jobs last night. Apparently there is a cluster that doesn't have a name.

clintonb commented 9 years ago

This treats the symptom; but, why doesn't the cluster have a name?

brianhw commented 9 years ago

LGTM. :+1:

mulby commented 9 years ago

@clintonb agreed - on thanksgiving morning at 6 am I'm definitely in "get it working mode", we should figure out what the root cause of this is.

I'll add some logging, hopefully that will help with the diagnosis.

clintonb commented 9 years ago

Oh yeah. Good call. Thanks!

Clinton Blackburn

mulby commented 9 years ago

So it looks like there was a AMI version 3.3.1 cluster that was created somehow without a name nor a job flow ID. Amazon really should prevent this sort of thing from happening since that's a pretty pathalogical case... but they don't apparently so we have to defend against it in our logic.

While I was in here I sped this up pretty significantly since it now only cycles through clusters that are not terminated, instead of all clusters we have ever created (which used to be small number, now isn't so small). I also pushed it over to use the master private IP address instead of the public DNS name since that's what we have to use locally. Finally, I added some debug output that shows us the job flow name and private IP of the master node to aide with debugging and generally make it easier to work with.

Note that I cannot add logging here since Ansible merges the stdout and stderr streams and requires that the merged result is valid JSON. I catch all exceptions in the "main" function and return valid JSON containing a description of the error. I did update the code to provide a more detailed error message (with stack trace etc). It is of limited utility, however, since ansible actually rewrites the python code, messing up line numbers etc, but at least it tells you what functions were called...

@brianhw @clintonb

mulby commented 9 years ago

also @rocha

mulby commented 9 years ago

and @benpatterson

brianhw commented 9 years ago

LGTM. :+1: And I'm likely the one who created the one with AMI 3.3.1 during the hackathon.

rocha commented 9 years ago

@mulby I may have been the one that spun the 3.3.1 cluster (I was checking Hue). However, I do remember it had an ID since I used the EMR console tools to access it. I remember shutting it down, maybe that is what caused the problem?

rocha commented 9 years ago

Apart from that :rocket:

Maybe we should use a different account when testings EMR features.

mulby commented 9 years ago

@rocha it's very odd, since I don't see it anywhere in the console.

benpatterson commented 9 years ago

Obviously it prevents errors, but I'm curious if there's any improvement in provisioning time/accuracy after the ALIVE_STATES list is put into use.

benpatterson commented 9 years ago

(and thanks for the improvements along with the fix. :) )