nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Flintrock Describe errors if one of the clusters is shutting down #236

Closed steve-drew-strong-bridge closed 6 years ago

steve-drew-strong-bridge commented 6 years ago

When trying to list the clusters that are still running, I receive an untrapped error when one of the clusters is in the process of shutting down. Running this same command a minute later will work, as

Expected: continue to describe the instances and maybe give a brief eulogy for the dead cluster.

flintrock describe --ec2-region=us-west-2

Traceback (most recent call last): File "/usr/local/bin/flintrock", line 11, in sys.exit(main()) File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 1132, in main cli(obj={}) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 535, in invoke return callback(args, *kwargs) File "/usr/local/lib/python3.4/site-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), args, **kwargs) File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 510, in describe vpc_id=ec2_vpc_id) File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1009, in get_clusters _get_cluster_name(instance) for instance in all_clusters_instances} File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1009, in _get_cluster_name(instance) for instance in all_clusters_instances} File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1063, in _get_cluster_name i=instance.id)) Exception: Could not extract cluster name from instance: i-0f4f75fb5f429994f

nchammas commented 6 years ago

Yeah, I've hit this myself. This would be a good issue for a new contributor to investigate and submit a patch for, since I suspect the fix is not too involved. I'd be happy to help.

Taking a quick look, I think the fix here is to update this snippet to ignore instances that are shutting-down or terminated (Boto3 ref, instance lifecycle ref).

dm-tran commented 6 years ago

flintrock describe also raises an error when a cluster is launched or when slaves are added :

Traceback (most recent call last):
  File "standalone.py", line 11, in <module>
  File "flintrock/flintrock.py", line 1110, in main
  File "click/core.py", line 716, in __call__
  File "click/core.py", line 696, in main
  File "click/core.py", line 1060, in invoke
  File "click/core.py", line 889, in invoke
  File "click/core.py", line 534, in invoke
  File "click/decorators.py", line 17, in new_func
  File "flintrock/flintrock.py", line 477, in describe
  File "flintrock/ec2.py", line 1053, in get_clusters
  File "flintrock/ec2.py", line 1053, in <listcomp>
  File "flintrock/ec2.py", line 1104, in _compose_cluster 
  File "flintrock/ec2.py", line 1080, in _get_cluster_master_slaves
TypeError: 'NoneType' object is not iterable

It seems that boto filters cannot exclude values. Instead, we have to specify values that are allowed : I think we're only interested in "pending" and "running" instances.

I tried adding the following filter to this snippet :

{'Name': 'instance-state-name', 'Values': ['pending', 'running']},
steve-drew-strong-bridge commented 6 years ago

@dm-tran you may want to add 'inconsistent' to the filter. While Flintrock reports 0 nodes for inconsistent clusters, there may actually be nodes still running. You can repo this by starting a flintrock launch clustername (using 2 or 3 nodes) and then just ctrl-c after it requests the instances. (Which, I admit I have done because I realized I used the wrong config.) :-D

In that scenario the security group is created and the instances started, but flintrock reports an inconsistent state and 0 nodes. Only by tracking down the security group and finding the associated instances can those be removed.

All of which is to say, if we filter out 'inconsistent' we may have nodes out there costing us money that we're unaware of.

dm-tran commented 6 years ago

@steve-drew-strong-bridge actually, "inconsistent" is not a possible value of "instance-state-name". You can read the instance lifecycle ref to see the possible states.

The state displayed by flintrock describe comes from method state of ec2.py (https://github.com/nchammas/flintrock/blob/master/flintrock/ec2.py#L113):

nchammas commented 6 years ago

@dm-tran

I think we're only interested in "pending" and "running" instances.

I think we are also interested in instances that are stopping or stopped, since flintrock describe should work with stopped clusters.

So my original recommendation to include all states except shutting-down or terminated stands, but as you pointed out we have to implement that by explicitly enumerating all the desired states since Boto3 does not support exclude filters.

dm-tran commented 6 years ago

@nchammas We should indeed list stopping and stopped instances, since Flintrock supports start and stop. I will do some tests and submit a PR.

dm-tran commented 6 years ago

I opened PR https://github.com/nchammas/flintrock/pull/246

nchammas commented 6 years ago

Fixed by #246. Sounds like there is still a minor case where flintrock describe may yield an error. From @dm-tran's post on #246:

This PR partially fixes Exception: Could not extract cluster name from instance: when a cluster is destroyed or when slaves are removed (#236): this exception will be raised during the first seconds instead of being raised during 30 seconds / a minute.

But we can split that to a separate issue if necessary.