Closed douglaz closed 7 years ago
To solve the inconsistent cluster (instances without any tags), I would change the code to check only the security group in this case and consider all instances as slaves (We've similar code in our system and this is what we do in a situation like this)
Though I do have TODO comments scattered throughout the codebase, I usually leave those as I'm making other, more substantial changes. For an issue like this, I think we should have a separate issue, perhaps titled "Gracefully handle interrupted launches where cluster_instances is not defined". That will give us something we can easily reference, and it gives the issue more visibility than a code comment.
To solve the inconsistent cluster (instances without any tags), I would change the code to check only the security group in this case and consider all instances as slaves (We've similar code in our system and this is what we do in a situation like this)
We can try to automatically piece together a cluster even when things look broken, but I'd rather not do that and risk doing something wrong, like deleting instances that Flintrock doesn't own. That would be terrible.
Instead, I think the focus here should be on always cleaning up properly whenever a launch goes wrong. If we don't succeed in doing that, I think it's better to force the user to manually cleanup to avoid any chance of Flintrock doing the wrong thing.
I believe Flintrock correctly cleans up in all cases except the one where EC2 hasn't even had a chance to return from create_instances()
. I asked about how to handle this case on the boto3 project and it looks like we need to use the ClientToken
parameter to track instances across EC2 calls, completely independent of any Flintrock tags or security groups, and even if cluster_instances
isn't defined. That way, even if create_instances()
is interrupted, we can find the instances that were left behind by using the ClientToken
.
What do you think?
The ClientToken approach is interesting but it doesn't help in my use case, I think.
My use case is the following: I have a "supervisor" script that when it starts it try to find an existing cluster in good shape. If it doesn't find one, it preemptively destroys all machines that may have been created in a previous execution and then starts provisioning a new cluster from scratch (and later proceeds to run the configured jobs). Today it destroys by terminating any machines in the security group of cluster name (I don't put non-cluster machines in the same security group of cluster name, so I can safely do that).
This script is itself supervised and if anything happens it will be restarted until it finishes what needs to be done. The problem with the ClientToken approach is if someone or something hard interrupts this script while launching a cluster (e.g kill -9, Linux's OOM-killer, machine crash, etc). When this script comes back, it won't be able to kill the machines left by the previous execution unless we have some deterministic/safe way to derive the ClientToken used in the previous execution. (the script itself is stateless, it doesn't know what it did previously)
As AWS says we shouldn't reuse a client token and saving/retrieving a random-generated token is tricky/complex, I can't see how we could gracefully handle such situation.
The problem with the ClientToken approach is if someone or something hard interrupts this script while launching a cluster (e.g kill -9, Linux's OOM-killer, machine crash, etc).
How often does this happen for you? The situation I was targeting was when the user kills Flintrock with a simple KeyboardInterrupt
before create_instances()
could return. In that case, we would be able to cleanup before exiting.
If we want to handle hard exits then I agree the ClientToken
won't help much. In that case, hmm... perhaps your approach is what we have to do. I'm a little afraid of it though. 😓
How about we display an additional warning from inside _get_cluster_master_slaves()
when Flintrock comes across a broken cluster and then return all the instances as slaves, as you are suggesting? Would that make cleaning up after an interrupted launch a bit better?
I think the ClientToken approach is good for the soft kill. For the hard kill, I find no harm in warning the user, although I will always call it with --assume-yes
How often does this happen for you?
Once is enough. It's no fun to manually kill dozens of machines.
OK, understood and accepted.
Regarding my comment:
How about we display an additional warning from inside
_get_cluster_master_slaves()
when Flintrock comes across a broken cluster and then return all the instances as slaves, as you are suggesting?
Would that solve both the soft and hard kill problem for you? As long as flintrock describe
and flintrock destroy
work for these kinds of borked clusters, the user can clean up without manually killing instances.
If this works for you, do you want to implement it in this PR? No worries if not. I want to settle on the approach at the very least.
@nchammas, I agree with that approach. I can't promise to tackle that right know. Perhaps we could use a different PR for that?
While testing some code, I canceled the spot request. An exception was raised but the try-except itself raised an error (UnboundLocalError: local variable 'cluster_instances' referenced before assignment):
Then I tried killing the cluster but caught:
Note: if you prefer to create one (or two) issues for the problems reported, feel free. I used FIXMEs because this is what I would do normally in a situation like this