nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

Option for Minimum EBS Root Volume Size #174

Closed PiercingDan closed 7 years ago

PiercingDan commented 7 years ago

There should be an option to modify min_root_device_size_gb = 30 in line 626, ec2.py to any desired value in the flintrock configuration file. 30 GB may be excessive and costly in some cases, provided the AMI is smaller than 30 GB (10 GB in my case).

Edit: I address this also in my guide.

nchammas commented 7 years ago

If I'm remembering my Flintrock history correctly, I believe I set the default size to 30 GB because 10 GB is not enough to build Spark from source, which is one of the features that Flintrock supports. The initial 10 GB default was also reported as too small by several early users of Flintrock. I set this new default in #50.

What's the additional cost when going from 10 GB to 30 GB for the root volume if, say, you have a 100-node cluster? I remember it being minuscule, but I don't have a hard calculation documenting it.

I'm inclined to leave this default as-is without an option to change it, since every new option complicates things a bit. But if the added cost is significant I would be open to reconsidering, since I know one of the reasons people use Flintrock over, say, EMR is to cut costs.

PiercingDan commented 7 years ago

EDIT: Below has been modified

From my guide (based on https://aws.amazon.com/ebs/pricing/)

The price for Amazon EBS gp2 volumes is $0.10 per GB-month for US East and since Flintrock sets its default minimum EBS root volume to be 30 GB, the EBS volumes costs about $0.10/hour day per instance or $0.004/hour per instance regardless of the instance type or AMI, whereas spot-requested m3.medium instances cost about $0.01/hour per instance.

The price is comparable to the instance cost.

pragnesh commented 7 years ago

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

PiercingDan commented 7 years ago

You could do one of the following:

pragnesh commented 7 years ago

@PiercingDan EBS gp2 volume pricing is $0.10 per GB-month so it only cost $3 per month for 30 GB and hourly cost 3/(24*30)=0.004 is less then instance cost $0.01/hour

PiercingDan commented 7 years ago

Good catch @pragnesh

nchammas commented 7 years ago

@pragnesh:

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

Flintrock deploys HDFS to the EBS root volume only if the AMI has no ephemeral volumes attached. If you select a larger instance type that has ephemeral volumes (also called instance store volumes) Flintrock will use those instead for HDFS. That's because they are super fast (faster than EBS), and Flintrock users (from my understanding) typically use HDFS in conjunction with Spark to share things like shuffle files or temporarily stage data before starting their job. The permanent store for these users is typically something like S3. I strongly recommend against using Flintrock-managed HDFS as anything other than a temporary store for your data.

This should probably be documented explicitly somewhere. I don't believe it currently is.

pragnesh commented 7 years ago

@nchammas We use hdfs only for temporary store. I know we can use instance with ephemeral volume if we need more hdfs storage. But spot instance price for instance store volume usually high and also change frequently, so in order to avoid losing instance, we tend to use instance like m4.large. For us EBS performance for hdfs is not big issue compare to instance loss during running job. we can workaround this issue by having more instance. I just nice to have some setting during launch config.