skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

[Jobs] Parallel execution for DAG #4055

Open cblmemo opened 1 month ago

cblmemo commented 1 month ago

Blocked by #4054.

As a first step, we should support parallel execution for some basic DAG, in our jobs controller. For the following example, we should support parallel execution of the two finetune task.

image
cblmemo commented 1 month ago

Assigning @andylizf

andylizf commented 1 month ago

@Michaelvll Question about self.dag in StrategyExecutor:

  1. Is self.dag intended as a future-proof design? If so, what scenarios were considered?

  2. Is it correct to assume that self.dag is unrelated to parallel execution of independent tasks at the same level?

  3. Or is it simply for convenience in passing arguments to self.launch, with no special significance?

https://github.com/skypilot-org/skypilot/blob/7971aa25fb6a5ffc45464be62d1af64fc3f46527/sky/jobs/recovery_strategy.py#L67-L80

Understanding this would help us implement parallel execution effectively. Thanks!