[DAG] Integrate Data Storage Buckets for Data-Bearing Edges in Optimization

skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

https://skypilot.readthedocs.io

Apache License 2.0

6.81k stars 513 forks source link

[DAG] Integrate Data Storage Buckets for Data-Bearing Edges in Optimization #4320

Closed euclidgame closed 6 days ago

euclidgame commented 1 week ago

This PR consider data storage in optimizations and executions of workflows that involve data transfer between tasks.

Tested (run the relevant ones):

[x] Code formatting: bash format.sh
[x] Any manual or new tests for this PR (please specify below)
- [x] A diamond like workflow where two tasks use the outputs of their upstream tasks as input (see examples/dag/diamond.yml)
[ ] All smoke tests: pytest tests/test_smoke.py
[ ] Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
[ ] Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

andylizf commented 1 week ago

@cblmemo PTAL, thanks!

andylizf commented 1 week ago

It seems some additional handling is required for generating bucket names.

import sky
from sky import Resources
from sky.optimizer import OptimizeTarget

with sky.Dag() as dag:
    task1 = sky.Task(name="task1", run="echo 'Hello, world!'")
    task1.set_resources(Resources(cpus=8))

    task2 = sky.Task(name="task2", run="echo 'Hello, world!'")
    task2.set_resources(Resources(cpus=8))

    (task1 >> task2).with_data('/tmp/data', '/tmp/data', 30)

sky.optimize(dag, OptimizeTarget.TIME)

cblmemo commented 1 week ago

Please make sure this works for #4364. cc @andylizf

euclidgame commented 1 week ago

@cblmemo Please take a look