grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
966 stars 52 forks source link

Support C5d and M5d instance types #40

Open josh-newman opened 6 years ago

josh-newman commented 6 years ago

The new local storage instance types seem like good fits for reflow workers.

mariusae commented 6 years ago

Probably a good strategy is to let ec2cluster choose storage strategy depending on instance size: if there is sufficient local storage available, use it instead of EBS.

josh-newman commented 6 years ago

Would that mean all execs in a run need to fit under the instance storage max for any of them to use instance storage? (Since there's a single disk space configuration per cluster, if I understand correctly.) It might be nice to let some execs use instance storage while others spill over to EBS. Maybe software RAID could help do this spillover seamlessly? (If it is aware of SSD vs. HDD differences, maybe it can handle this too?)

Those sound more complicated, though, so something simple would be great to start with.

mariusae commented 6 years ago

I think that's too complicated. The simplest thing would be to do just one or the other. If you require more storage than is available on instance storage, tough luck...

swami-m commented 5 years ago

Having looked through the instance types with storage, I doubt that there's much value in pursuing this. Many of these instance types with storage have much less CPU/Ram compared to others at similar or cheaper cost. So we are probably better off using a cheaper but beefier or equivalent instance type with attached EBS volumes.

We are considering supporting dynamic resizing of EBS volumes in a reflowlet instance at which point we can start with a conservative size of EBS volume thereby reducing cost further.

Its probably worth exploring for bigmachine, particularly for say machine learning type use-cases where we do repeated IO over the same data.

josh-newman commented 5 years ago

@swami-m, which instance types are you looking at? M5, C5, and R5 all have M5d, C5d, and R5d. The price differential doesn't seem huge, for example $0.096/hr for M5 vs. $0.113/hr for M5d, which have the same CPU and memory. The ~20% cost increase for local disk could be worthwhile for some workloads, right?

I totally understand if this is low priority / not worth the complexity, though.

swami-m commented 5 years ago

Yeah, but an M5d.large comes with 75GB of local storage at 20% higher cost compared to a M5.large. That's just not enough instance storage for it to be worthwhile.

Since reflow's instance-type-choosing logic is primarily driven by price (for the CPU/mem requirements), and since instance types with storage are always more expensive, we are not likely to choose these types (unless the cheaper ones are not available)

And purely cost-wise, EBS volumes are cheaper than instance storage (in the above example 75GB of EBS costs $0.01/hr). The performance boost of using instance storage is probably worth it for repetitive I/O over the same data, which doesn't happen in majority of reflow use-cases.

All that being said, I agree that it could be worthwhile in cases where the user constrains reflow to use certain instance types, particularly those with instance storage. (In that case, since the user is already paying for it, might as well use it and not pay more for EBS)