nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Avoid exception if instance has no tags #179

Closed douglaz closed 7 years ago

douglaz commented 7 years ago

While testing some code, I canceled the spot request. An exception was raised but the try-except itself raised an error (UnboundLocalError: local variable 'cluster_instances' referenced before assignment):

2017-01-24 19:20:24,421 0 of 101 instances granted. Waiting...
^CTraceback (most recent call last):
  File "/home/allan/flintrock/flintrock/ec2.py", line 730, in _create_instances
    time.sleep(30)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/allan/flintrock/flintrock/ec2.py", line 905, in launch
    user_data=user_data)
  File "/home/allan/flintrock/flintrock/ec2.py", line 788, in _create_instances
    logger.info("Canceling spot instance requests...", file=sys.stderr)
  File "/usr/lib/python3.4/logging/__init__.py", line 1274, in info
    self._log(INFO, msg, args, **kwargs)
TypeError: _log() got an unexpected keyword argument 'file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/allan/flintrock/standalone.py", line 11, in <module>
    sys.exit(main())   
  File "/home/allan/flintrock/flintrock/flintrock.py", line 1087, in main
    cli(obj={})
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/allan/flintrock/flintrock/flintrock.py", line 371, in launch
    user_data=ec2_user_data)
  File "/home/allan/flintrock/flintrock/ec2.py", line 54, in wrapper
    res = func(*args, **kwargs)
  File "/home/allan/flintrock/flintrock/ec2.py", line 948, in launch
    cleanup_instances = cluster_instances
UnboundLocalError: local variable 'cluster_instances' referenced before assignment

Then I tried killing the cluster but caught:

Traceback (most recent call last):                                                                                                                                           
  File "/home/allan/flintrock/standalone.py", line 11, in <module>                                                                                                           
    sys.exit(main())   
  File "/home/allan/flintrock/flintrock/flintrock.py", line 1087, in main
    cli(obj={})
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/allan/flintrock/flintrock/flintrock.py", line 422, in destroy
    vpc_id=ec2_vpc_id) 
  File "/home/allan/flintrock/flintrock/ec2.py", line 964, in get_cluster
    vpc_id=vpc_id)
  File "/home/allan/flintrock/flintrock/ec2.py", line 1009, in get_clusters
    for cluster_name in found_cluster_names]
  File "/home/allan/flintrock/flintrock/ec2.py", line 1009, in <listcomp>
    for cluster_name in found_cluster_names]
  File "/home/allan/flintrock/flintrock/ec2.py", line 1060, in _compose_cluster
    (master_instance, slave_instances) = _get_cluster_master_slaves(instances)
  File "/home/allan/flintrock/flintrock/ec2.py", line 1036, in _get_cluster_master_slaves
    for tag in instance.tags:
TypeError: 'NoneType' object is not iterable

Note: if you prefer to create one (or two) issues for the problems reported, feel free. I used FIXMEs because this is what I would do normally in a situation like this

douglaz commented 7 years ago

To solve the inconsistent cluster (instances without any tags), I would change the code to check only the security group in this case and consider all instances as slaves (We've similar code in our system and this is what we do in a situation like this)

nchammas commented 7 years ago

Though I do have TODO comments scattered throughout the codebase, I usually leave those as I'm making other, more substantial changes. For an issue like this, I think we should have a separate issue, perhaps titled "Gracefully handle interrupted launches where cluster_instances is not defined". That will give us something we can easily reference, and it gives the issue more visibility than a code comment.

To solve the inconsistent cluster (instances without any tags), I would change the code to check only the security group in this case and consider all instances as slaves (We've similar code in our system and this is what we do in a situation like this)

We can try to automatically piece together a cluster even when things look broken, but I'd rather not do that and risk doing something wrong, like deleting instances that Flintrock doesn't own. That would be terrible.

Instead, I think the focus here should be on always cleaning up properly whenever a launch goes wrong. If we don't succeed in doing that, I think it's better to force the user to manually cleanup to avoid any chance of Flintrock doing the wrong thing.

I believe Flintrock correctly cleans up in all cases except the one where EC2 hasn't even had a chance to return from create_instances(). I asked about how to handle this case on the boto3 project and it looks like we need to use the ClientToken parameter to track instances across EC2 calls, completely independent of any Flintrock tags or security groups, and even if cluster_instances isn't defined. That way, even if create_instances() is interrupted, we can find the instances that were left behind by using the ClientToken.

What do you think?

douglaz commented 7 years ago

The ClientToken approach is interesting but it doesn't help in my use case, I think.

My use case is the following: I have a "supervisor" script that when it starts it try to find an existing cluster in good shape. If it doesn't find one, it preemptively destroys all machines that may have been created in a previous execution and then starts provisioning a new cluster from scratch (and later proceeds to run the configured jobs). Today it destroys by terminating any machines in the security group of cluster name (I don't put non-cluster machines in the same security group of cluster name, so I can safely do that).

This script is itself supervised and if anything happens it will be restarted until it finishes what needs to be done. The problem with the ClientToken approach is if someone or something hard interrupts this script while launching a cluster (e.g kill -9, Linux's OOM-killer, machine crash, etc). When this script comes back, it won't be able to kill the machines left by the previous execution unless we have some deterministic/safe way to derive the ClientToken used in the previous execution. (the script itself is stateless, it doesn't know what it did previously)

As AWS says we shouldn't reuse a client token and saving/retrieving a random-generated token is tricky/complex, I can't see how we could gracefully handle such situation.

nchammas commented 7 years ago

The problem with the ClientToken approach is if someone or something hard interrupts this script while launching a cluster (e.g kill -9, Linux's OOM-killer, machine crash, etc).

How often does this happen for you? The situation I was targeting was when the user kills Flintrock with a simple KeyboardInterrupt before create_instances() could return. In that case, we would be able to cleanup before exiting.

If we want to handle hard exits then I agree the ClientToken won't help much. In that case, hmm... perhaps your approach is what we have to do. I'm a little afraid of it though. 😓

How about we display an additional warning from inside _get_cluster_master_slaves() when Flintrock comes across a broken cluster and then return all the instances as slaves, as you are suggesting? Would that make cleaning up after an interrupted launch a bit better?

douglaz commented 7 years ago

I think the ClientToken approach is good for the soft kill. For the hard kill, I find no harm in warning the user, although I will always call it with --assume-yes

douglaz commented 7 years ago

How often does this happen for you?

Once is enough. It's no fun to manually kill dozens of machines.

nchammas commented 7 years ago

OK, understood and accepted.

Regarding my comment:

How about we display an additional warning from inside _get_cluster_master_slaves() when Flintrock comes across a broken cluster and then return all the instances as slaves, as you are suggesting?

Would that solve both the soft and hard kill problem for you? As long as flintrock describe and flintrock destroy work for these kinds of borked clusters, the user can clean up without manually killing instances.

If this works for you, do you want to implement it in this PR? No worries if not. I want to settle on the approach at the very least.

douglaz commented 7 years ago

@nchammas, I agree with that approach. I can't promise to tackle that right know. Perhaps we could use a different PR for that?