Resource allocation + GPU allocation

uriafranko commented 1 year ago

Hey guys, love your work :)

I'm trying to figure out how can I allocate the relevant resources for each flow.

For example: I would like to allocate for a specific flow 2CPU, 8GB memory and for another flow 4 CPU 16GB memory and 1 GPU core.

In both Pulumi and Ray clusters, you can specify those for each worker, any idea how can we modify Buildflow to support GPUs and custom resource allocation?

JoshTanke commented 1 year ago

Hey Uria, I appreciate the kind words! 🙂

We allocate resources at the processor level - for example you can set the num_cpus to use per replica inside the pipeline decorator like: @app.pipeline(..., num_cpus=2). Here are docs on this

It sounds like you have 2 separate workflows you want to set up? Flows are essentially just a container type in our framework, so you could either model it as 2 flows, or 2 processors in the same flow. Combining them into the same flow would let them share compute resources, which is usually more cost effective (they will both autoscale independent of each other still).

I don't think we have exposed the memory & gpu options yet, but that should be trivial to add. I can get a PR together this weekend!

Aside: I'm happy to hop on a call this next week if you'd like to dive into any more specifics! Would love to hear about your use case to see how we can best support it 🙂

JoshTanke commented 1 year ago

https://github.com/launchflow/buildflow/pull/267 exposes the ray options. I'll need to sync with Caleb about adding support in the autoscaler before I land it, but you can install the temporary change by pointing pip at the expose-gpu-and-memory branch: pip install git+https://github.com/launchflow/buildflow.git@expose-gpu-and-memory

One thing to watch out for: Ray's memory option is just for its own internal scheduler and will not enforce that your processor only uses X amount of memory (ray docs on this). If you start hitting OOM errors with this option set, you're most likely using more memory than you told the ray scheduler you would (we hit this a bunch early on).

I would personally recommend not setting the memory option if you can avoid it - errors in the ray scheduler can be really hard to diagnose. Once it comes time to deploy, you can control the memory usage by changing the machine type you use for each worker in your ray cluster, and then let the scheduler keep track of memory usage for you.

If you're running locally, you can start your ray cluster with specific resources limits: ray start --head --num-gpus=3

launchflow / buildflow

Resource allocation + GPU allocation #266