petuum / adaptdl

Resource-adaptive cluster scheduler for deep learning training.
https://adaptdl.readthedocs.io/
Apache License 2.0
425 stars 76 forks source link

Use empty string for all-inclusive pod-label-selector #113

Closed odp closed 2 years ago

odp commented 2 years ago

Two successive non-preemptible AdaptDL jobs received same allocation by the single job allocation optimization introduced by #66

INFO:__main__:Patch AdaptDLJob  namespace/a-b-c-dzjr4: {'status': {'allocation': ['X.Y.0.232']}}
INFO:__main__:Patch AdaptDLJob  namespace/a-b-c-qczpw: {'status': {'allocation': ['X.Y.0.232']}}

The jobs being non-preemptible, this later crashes in the Pollux optimizer (because two jobs have same allocation) when a full allocation cycle starts.

    await self._optimize_all()
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/allocator.py", line 126, in _optimize_all
    allocations = self._allocate(jobs, nodes, prev_allocations,
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/allocator.py", line 259, in _allocate
    allocations, desired_nodes = self._policy.optimize(
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 191, in optimize
    problem = Problem(list(jobs.values()), list(nodes.values()) +
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 275, in __init__
    self._max_replicas[j, n] = min(
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 276, in <genexpr>
    self._get_avail_resource(
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 299, in _get_avail_resource
    assert resource >= 0
AssertionError

The issue happens because a None pod label qualifier is used to include all pods instead of an empty string "". This causes the code to always assume all nodes are available (because None returns no pods) and it assigns the first one to every incoming job. This bug also affects preemptible jobs but without any visible implications because the allocation gets fixed by a later full allocation cycle.

With the fix, every new job gets the next available node.