lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
515 stars 153 forks source link

ResourceSlot consistency when slot types are gone and added at random timings #84

Closed achimnol closed 4 years ago

achimnol commented 4 years ago

Since it is impossible to determine the exact set of resource slots that all currently alive agents have unless we perform exhaustive scanning, we decided to accumulate and merge all possible resource slot types as they appear.

This allows mixture of agents that has different sets of accelerator plugins installed, in a single backend.AI cluster. Previously we had to install the same accelerator plugins on the nodes that even do not have the accelerators.


Common: Many comparison/calcuation restrictions for ResourceSlot type when the operands' keys are different is lifted. Let's just treat undefined fields as zeros in each operand. The common module and the manager still uses ai.backend.common.types.current_resource_slots context-variable for sanity checks, but this is no longer that restrictive as in prior versions.

Manager: Now we can keep the resource slot types (ai.backend.common.types.current_resource_slots) consistent during handling of a single API request thanks to Python 3.7's contextvars, preventing potential race conditions when resolving this issue.

Agent: The known resource slot types are determined when the agent starts and loads the accelerator plugins. It now accepts any resource slots when creating a new kernel, but only continues to launch the kernel when all unknown slot values are set to zero. Otherwise, UnsupportedResource exception is raised and passed back to the manager.

When the agent is restarted and the accelerator plugins (or their configurations) are changed so that the agent's knwon slot types are also changed, the agent just keeps the existing containers running as the manager will take care of resource tracking of them.

NOTE: Executing new sessions with renamed slots (e.g., cuda.shares -> cuda.device) may conflict with existing containers, but we currently do not support such scenarios and cover it with user guidelines.


Let's update our scheduling and statistics code to work with this change.

achimnol commented 4 years ago

This will help us to deploy #82.