nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Sometimes describe gets the cluster name wrong #318

Closed gnsiva closed 9 months ago

gnsiva commented 3 years ago

Thanks for all your work on this, it has been an invaluable tool for my team. I have a slightly intermittent issue where flintrock describe gets the cluster name wrong. I've tracked down where it is happening in the code but don't necessarily have a good approach for how to fix it.

I create a cluster:

flintrock --config <config> launch \
    --ec2-identity-file <> \
    --ec2-key-name <> \
    --spark-version 2.4.5 \
    --hdfs-version 2.7.6 \
    "spark-test-dag" 
    --ec2-security-group "flintrock-system-security-group" 
    --num-slaves 1 \
    --ec2-instance-type c5.large \
    --ec2-spot-price 0.8 \
    --assume-yes

The command outputs the following at the end, it got the cluster name correctly from the command.

Login with: flintrock login spark-test-dag

Then running describe I get the following:

Found 1 cluster in region cn-north-1.
---
system-security-group:
  state: running
  node-count: 2
  master: <>
  slaves:
    - <>

So it has taken the cluster name from the security group I specified, minus the flintrock- at the beginning. Looking in the ec2 console, the security groups are listed as;

flintrock, flintrock-system-security-group, flintrock-spark-test-dag

And in the describe code (ec2.py line 1060), it gets the cluster name by taking the first security group that starts with flintrock-

    for group in instance.security_groups:
        if group['GroupName'].startswith('flintrock-'):
            return group['GroupName'].replace('flintrock-', '', 1)

So it seems like it has added the security groups out of order and that has caused the problem.

For now I will rename that security group to system-security-group-flintrock :smile: , but thought it would be good to document this.

nchammas commented 3 years ago

Hello @gnsiva. I'm glad Flintrock is useful to you and your team!

Thank you for this clear bug report and breakdown of where the issue is coming from. I guess you're pointing to this code: https://github.com/nchammas/flintrock/blob/227d04b361c5af9e96b97266654d711f3d50fdc3/flintrock%2Fec2.py#L1033-L1035

It looks like a classic issue with using a user-modifiable source of information (the list of associated security groups) to identify the cluster name. Instead, we should somehow capture the information in a way that cannot be confused by user input.

I don't have a fix to propose off the top of my head, but this is definitely an issue.

gnsiva commented 3 years ago

Hi @nchammas, thanks for your quick reply. Just to check I was clear, the cluster and ec2 instances was not modified apart from by flintrock. Perhaps using a tag containing the cluster name could work, it would still be possible for a user to edit it, but it should solve the race condition problem that there appears to be.

nchammas commented 3 years ago

I forget why I didn't or couldn't use tags for this purpose. I remember there was some sort of problem with them, but I'll have to check my notes or dig through the history to remember what it was.

Yes, you were clear. What I am pointing out is that the names of the security groups provided via --ec2-security-group are user-specified, so there is user-controlled input in the names that get parsed by _get_cluster_name(), hence the problem you are having. The code assumes that only Flintrock will create a security group named starting with flintrock-, which is certainly not true.

gnsiva commented 3 years ago

Ah OK, if tags don't work, I can't think of a better idea either. Cheers!

nchammas commented 9 months ago

@gnsiva - It's about 3.5 years too late, but I believe #362 fixes this issue. We now use tags instead of security groups to name and identify clusters.

It turns out that the reason we weren't using tags to begin with was because EC2 didn't let you specify tags on launch -- tagging had to be a separate operation. But that's no longer the case, and now we tag instances as part of the launch.

Related discussion in #339.