Closed steve-drew-strong-bridge closed 6 years ago
Yeah, I've hit this myself. This would be a good issue for a new contributor to investigate and submit a patch for, since I suspect the fix is not too involved. I'd be happy to help.
Taking a quick look, I think the fix here is to update this snippet to ignore instances that are shutting-down
or terminated
(Boto3 ref, instance lifecycle ref).
flintrock describe
also raises an error when a cluster is launched or when slaves are added :
Traceback (most recent call last):
File "standalone.py", line 11, in <module>
File "flintrock/flintrock.py", line 1110, in main
File "click/core.py", line 716, in __call__
File "click/core.py", line 696, in main
File "click/core.py", line 1060, in invoke
File "click/core.py", line 889, in invoke
File "click/core.py", line 534, in invoke
File "click/decorators.py", line 17, in new_func
File "flintrock/flintrock.py", line 477, in describe
File "flintrock/ec2.py", line 1053, in get_clusters
File "flintrock/ec2.py", line 1053, in <listcomp>
File "flintrock/ec2.py", line 1104, in _compose_cluster
File "flintrock/ec2.py", line 1080, in _get_cluster_master_slaves
TypeError: 'NoneType' object is not iterable
It seems that boto filters cannot exclude values. Instead, we have to specify values that are allowed : I think we're only interested in "pending" and "running" instances.
I tried adding the following filter to this snippet :
{'Name': 'instance-state-name', 'Values': ['pending', 'running']},
Exception: Could not extract cluster name from instance
: when a cluster is destroyed or when slaves are removed, this exception will be raised during the first seconds instead of being raised during 30 seconds / a minuteTypeError: 'NoneType' object is not iterable
. Instead of raising an error when a cluster is launched or when slaves are added, flintrock describe
will display :
my-cluster-name:
state: inconsistent
node-count: 0
@dm-tran you may want to add 'inconsistent' to the filter. While Flintrock reports 0 nodes for inconsistent clusters, there may actually be nodes still running. You can repo this by starting a flintrock launch clustername (using 2 or 3 nodes) and then just ctrl-c after it requests the instances. (Which, I admit I have done because I realized I used the wrong config.) :-D
In that scenario the security group is created and the instances started, but flintrock reports an inconsistent state and 0 nodes. Only by tracking down the security group and finding the associated instances can those be removed.
All of which is to say, if we filter out 'inconsistent' we may have nodes out there costing us money that we're unaware of.
@steve-drew-strong-bridge actually, "inconsistent" is not a possible value of "instance-state-name". You can read the instance lifecycle ref to see the possible states.
The state displayed by flintrock describe
comes from method state
of ec2.py (https://github.com/nchammas/flintrock/blob/master/flintrock/ec2.py#L113):
@dm-tran
I think we're only interested in "pending" and "running" instances.
I think we are also interested in instances that are stopping
or stopped
, since flintrock describe
should work with stopped clusters.
So my original recommendation to include all states except shutting-down
or terminated
stands, but as you pointed out we have to implement that by explicitly enumerating all the desired states since Boto3 does not support exclude filters.
@nchammas We should indeed list stopping
and stopped
instances, since Flintrock supports start
and stop
. I will do some tests and submit a PR.
I opened PR https://github.com/nchammas/flintrock/pull/246
Fixed by #246. Sounds like there is still a minor case where flintrock describe
may yield an error. From @dm-tran's post on #246:
This PR partially fixes Exception: Could not extract cluster name from instance: when a cluster is destroyed or when slaves are removed (#236): this exception will be raised during the first seconds instead of being raised during 30 seconds / a minute.
But we can split that to a separate issue if necessary.
When trying to list the clusters that are still running, I receive an untrapped error when one of the clusters is in the process of shutting down. Running this same command a minute later will work, as
Expected: continue to describe the instances and maybe give a brief eulogy for the dead cluster.
flintrock describe --ec2-region=us-west-2
Traceback (most recent call last): File "/usr/local/bin/flintrock", line 11, in
sys.exit(main())
File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 1132, in main
cli(obj={})
File "/usr/local/lib/python3.4/site-packages/click/core.py", line 722, in call
return self.main(args, kwargs)
File "/usr/local/lib/python3.4/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.4/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.4/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.4/site-packages/click/core.py", line 535, in invoke
return callback(args, *kwargs)
File "/usr/local/lib/python3.4/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), args, **kwargs)
File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 510, in describe
vpc_id=ec2_vpc_id)
File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1009, in get_clusters
_get_cluster_name(instance) for instance in all_clusters_instances}
File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1009, in
_get_cluster_name(instance) for instance in all_clusters_instances}
File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 1063, in _get_cluster_name
i=instance.id))
Exception: Could not extract cluster name from instance: i-0f4f75fb5f429994f