The jobs being non-preemptible, this later crashes in the Pollux optimizer (because two jobs have same allocation) when a full allocation cycle starts.
await self._optimize_all()
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/allocator.py", line 126, in _optimize_all
allocations = self._allocate(jobs, nodes, prev_allocations,
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/allocator.py", line 259, in _allocate
allocations, desired_nodes = self._policy.optimize(
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 191, in optimize
problem = Problem(list(jobs.values()), list(nodes.values()) +
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 275, in __init__
self._max_replicas[j, n] = min(
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 276, in <genexpr>
self._get_avail_resource(
File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/policy/pollux.py", line 299, in _get_avail_resource
assert resource >= 0
AssertionError
The issue happens because a None pod label qualifier is used to include all pods instead of an empty string "". This causes the code to always assume all nodes are available (because None returns no pods) and it assigns the first one to every incoming job. This bug also affects preemptible jobs but without any visible implications because the allocation gets fixed by a later full allocation cycle.
With the fix, every new job gets the next available node.
Two successive non-preemptible AdaptDL jobs received same allocation by the single job allocation optimization introduced by #66
The jobs being non-preemptible, this later crashes in the Pollux optimizer (because two jobs have same allocation) when a full allocation cycle starts.
The issue happens because a
None
pod label qualifier is used to include all pods instead of an empty string""
. This causes the code to always assume all nodes are available (becauseNone
returns no pods) and it assigns the first one to every incoming job. This bug also affects preemptible jobs but without any visible implications because the allocation gets fixed by a later full allocation cycle.With the fix, every new job gets the next available node.