Closed mulby closed 9 years ago
This treats the symptom; but, why doesn't the cluster have a name?
LGTM. :+1:
@clintonb agreed - on thanksgiving morning at 6 am I'm definitely in "get it working mode", we should figure out what the root cause of this is.
I'll add some logging, hopefully that will help with the diagnosis.
Oh yeah. Good call. Thanks!
Clinton Blackburn
So it looks like there was a AMI version 3.3.1 cluster that was created somehow without a name nor a job flow ID. Amazon really should prevent this sort of thing from happening since that's a pretty pathalogical case... but they don't apparently so we have to defend against it in our logic.
While I was in here I sped this up pretty significantly since it now only cycles through clusters that are not terminated, instead of all clusters we have ever created (which used to be small number, now isn't so small). I also pushed it over to use the master private IP address instead of the public DNS name since that's what we have to use locally. Finally, I added some debug output that shows us the job flow name and private IP of the master node to aide with debugging and generally make it easier to work with.
Note that I cannot add logging here since Ansible merges the stdout and stderr streams and requires that the merged result is valid JSON. I catch all exceptions in the "main" function and return valid JSON containing a description of the error. I did update the code to provide a more detailed error message (with stack trace etc). It is of limited utility, however, since ansible actually rewrites the python code, messing up line numbers etc, but at least it tells you what functions were called...
@brianhw @clintonb
also @rocha
and @benpatterson
LGTM. :+1: And I'm likely the one who created the one with AMI 3.3.1 during the hackathon.
@mulby I may have been the one that spun the 3.3.1 cluster (I was checking Hue). However, I do remember it had an ID since I used the EMR console tools to access it. I remember shutting it down, maybe that is what caused the problem?
Apart from that :rocket:
Maybe we should use a different account when testings EMR features.
@rocha it's very odd, since I don't see it anywhere in the console.
Obviously it prevents errors, but I'm curious if there's any improvement in provisioning time/accuracy after the ALIVE_STATES list is put into use.
(and thanks for the improvements along with the fix. :) )
This broke all jobs last night. Apparently there is a cluster that doesn't have a name.