Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
Since it is impossible to determine the exact set of resource slots that all currently alive agents have unless we perform exhaustive scanning, we decided to accumulate and merge all possible resource slot types as they appear.
This allows mixture of agents that has different sets of accelerator plugins installed, in a single backend.AI cluster. Previously we had to install the same accelerator plugins on the nodes that even do not have the accelerators.
Common:
Many comparison/calcuation restrictions for ResourceSlot type when the operands' keys are different is lifted. Let's just treat undefined fields as zeros in each operand.
The common module and the manager still uses ai.backend.common.types.current_resource_slots context-variable for sanity checks, but this is no longer that restrictive as in prior versions.
Manager:
Now we can keep the resource slot types (ai.backend.common.types.current_resource_slots) consistent during handling of a single API request thanks to Python 3.7's contextvars, preventing potential race conditions when resolving this issue.
Agent:
The known resource slot types are determined when the agent starts and loads the accelerator plugins.
It now accepts any resource slots when creating a new kernel, but only continues to launch the kernel when all unknown slot values are set to zero. Otherwise, UnsupportedResource exception is raised and passed back to the manager.
When the agent is restarted and the accelerator plugins (or their configurations) are changed so that the agent's knwon slot types are also changed, the agent just keeps the existing containers running as the manager will take care of resource tracking of them.
NOTE: Executing new sessions with renamed slots (e.g., cuda.shares -> cuda.device) may conflict with existing containers, but we currently do not support such scenarios and cover it with user guidelines.
Let's update our scheduling and statistics code to work with this change.
Since it is impossible to determine the exact set of resource slots that all currently alive agents have unless we perform exhaustive scanning, we decided to accumulate and merge all possible resource slot types as they appear.
This allows mixture of agents that has different sets of accelerator plugins installed, in a single backend.AI cluster. Previously we had to install the same accelerator plugins on the nodes that even do not have the accelerators.
Common: Many comparison/calcuation restrictions for
ResourceSlot
type when the operands' keys are different is lifted. Let's just treat undefined fields as zeros in each operand. The common module and the manager still usesai.backend.common.types.current_resource_slots
context-variable for sanity checks, but this is no longer that restrictive as in prior versions.Manager: Now we can keep the resource slot types (
ai.backend.common.types.current_resource_slots
) consistent during handling of a single API request thanks to Python 3.7'scontextvars
, preventing potential race conditions when resolving this issue.Agent: The known resource slot types are determined when the agent starts and loads the accelerator plugins. It now accepts any resource slots when creating a new kernel, but only continues to launch the kernel when all unknown slot values are set to zero. Otherwise,
UnsupportedResource
exception is raised and passed back to the manager.When the agent is restarted and the accelerator plugins (or their configurations) are changed so that the agent's knwon slot types are also changed, the agent just keeps the existing containers running as the manager will take care of resource tracking of them.
NOTE: Executing new sessions with renamed slots (e.g.,
cuda.shares
->cuda.device
) may conflict with existing containers, but we currently do not support such scenarios and cover it with user guidelines.Let's update our scheduling and statistics code to work with this change.