nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

add --ec2-master-instace-type option to support different master instance type #166

Closed pragnesh closed 7 years ago

pragnesh commented 7 years ago

This PR makes the following changes:

I tested this PR by...

launching cluster of different instance type as well as tested add-slaves by adding slaves on ec2. It is working fine.

I have made this change since having spark cluster with different master instance type will save money and improve resource utilisation.

nchammas commented 7 years ago

Hi @pragnesh and thank you for this PR. People have asked for this feature in the past, so I know there is some interest in it. 👍

When growing an existing cluster, Flintrock makes sure to read all possible configurations from the existing nodes so that newly added nodes are automatically configured the same way. This means that the slave instance type should automatically be selected based on the type of the existing slaves.

The only case where the user should be able to specify the slave instance type when adding nodes to an existing cluster is when that cluster has just a master and no slaves.

That said, I have to wonder out loud if it's better to just leave this feature out of Flintrock. The benefit is clear but marginal, and it adds a little bit of complexity. Since Flintrock is hard to test (the valuable tests -- i.e. the acceptance tests -- are slow and cost money), every little bit of added complexity will slowly make it harder to maintain the project over time. If it were practical to implement automated tests to cover all the possible use cases, I wouldn't be as concerned.

What are your thoughts on that?

I'm also curious if @serialx, @BenFradet, or anyone else watching the repo has any thoughts on the utility of this feature weighed against the long-term maintenance burden as the supported use cases inevitably grow over time.

pragnesh commented 7 years ago

I understand your concern. It is ok if you don't merge this feature.

serialx commented 7 years ago

Users with small clusters would be sensitive to master's instance type. Since the ratio of master instance cost would be larger when there are fewer slaves. Also for performance tests and experimentation -- which is Flintrock's main use case -- it would be tempting to use a smaller master instance type.

I also understand the long-term maintenance burden. Looking at the patch, wouldn't add_slaves ec2_instance_type parameter be avoidable if we save the ec2_instance_type variable in the cluster metadata?

nchammas commented 7 years ago

I also understand the long-term maintenance burden. Looking at the patch, wouldn't add_slaves ec2_instance_type parameter be avoidable if we save the ec2_instance_type variable in the cluster metadata?

Yes, that's true as long as we continue to disallow clusters from being created with 0 slaves. That would simplify the logic and supported use cases a bit. 👍

This comes up against another problem, though, which is that the manifest currently has no mechanism for persisting provider-specific settings like instance type. We'd need to add that in. It shouldn't be a big deal, but it may be beyond the scope of what @pragnesh was hoping to do in this PR.

I believe this was the same deficiency in our manifest implementation that @SoyeonBaek-dev hit up against. I forget what her use case was exactly, but she also wanted to be able to persist some EC2-specific things in the manifest. Is that correct @SoyeonBaek-dev? Do you remember what exactly it was?

dm-tran commented 7 years ago

Thanks for this PR @pragnesh 👍

This is a feature my team members and I are definitely interested in. Our daily Spark jobs are launched automatically using the Spark REST API which needs --deploy-mode cluster. With deploy mode "cluster", the driver is deployed on one of the slaves and the master is kind of "useless" (it doesn't need an instance as large as slaves). This would allow us to reduce costs.

nchammas commented 7 years ago

Thanks for chiming in @dm-tran and @serialx. This kind of feedback from the community is valuable and gives me the confidence to steer Flintrock in the right direction.

OK, I think given the feedback here we want to include a feature like this. To do it in a way that's consistent with Flintrock's design, we'll need to refine the manifest concept a bit to allow for provider-specific settings like EC2 instance type.

@pragnesh - Would you like to take that on? No worries if not. I will likely work on that next, as soon as I fix this one bug I'm working on at the moment.

pragnesh commented 7 years ago

yes, i can try to work on this, let me go through manifest concept first

nchammas commented 7 years ago

OK great. It's a little half-baked at the moment, unfortunately, but I can give guidance as needed. The idea is basically to add a section to the manifest that is dedicated to EC2-specific settings and then persist/load the EC2 slave instance-type to/from the manifest.

pragnesh commented 7 years ago

As a i am using my change i found issue that having different instance type for master and slave create issue with hdfs cluster. hdfs failed to start if there is different type of disk configuration, i am not sure why.

pragnesh commented 7 years ago

best way to avoid complexity and to use master instance as worker is updating slave_hosts template parameter to include master_host in core.py

'slave_hosts': '\n'.join([self.master_host] + self.slave_hosts),

I have tested this change and it works without any issue.

nchammas commented 7 years ago

Thanks for reporting these additional details @pragnesh.

hdfs failed to start if there is different type of disk configuration, i am not sure why.

This is probably because on larger instances HDFS will use instance store storage, whereas on smaller instances there will be no instance store storage so HDFS will use the EBS root drive. There's probably something going wrong there, especially if you restart the cluster.

best way to avoid complexity and to use master instance as worker is updating slave_hosts template parameter to include master_host in core.py

Are you saying that this fixes the issue with HDFS when the master and slave instance types are very different?

pragnesh commented 7 years ago

Are you saying that this fixes the issue with HDFS when the master and slave instance types are very different?

No, I am saying another approach to use master instance instead of keeping it idle is just add master host into list of slave hosts, so spark and hdfs cluster will run on all slave + master instances, this is just one line of change i have mention in previous comment and work without issue, and is not adding too much complexity. No need to have have different master and slave instance type to save cost.

nchammas commented 7 years ago

Ah, I see. That makes sense.

@serialx / @dm-tran - What do you think?

Having the master be one of the slaves would take care of the problems related to mismatched HDFS storage caused by different instance sizes.

dm-tran commented 7 years ago

This solutions looks elegant and simple.

It's fine for Spark slaves. For HDFS, it means that the namenode and one datanode will be deployed on the master: I'm a bit worried that it might overload the master (usually, the namenode is deployed on the master and datanodes are deployed on slaves).

pragnesh commented 7 years ago

If we are worried about overloading master, we can create flag for this, activate only if is set to true.

serialx commented 7 years ago

Flag would be good. I think we should try experiment with having master to be one of the slaves. Let's add this flag as an experimental flag.

nchammas commented 7 years ago

Per the conversation above, it sounds like we want a new PR that implements a different solution where the master is flagged to simply be one of the slaves. This saves money -- which is the original impetus for this PR -- while not adding much complexity. Perhaps we can call the new option --colocate-master/--no-colocate-master?

@pragnesh - I'm closing this PR, but feel free to open a new one along the lines discussed here.