Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
[!NOTE]
Step 1 targets 24.03, the remaining steps target 24.12.
Step 1
Allow superadmins to update status of kernels from PULLING to CANCELLED forcibly
Step 2
Implement check-and-pull API
It checks whether an agent has a specific image or not.
If the agent does not have the image, it starts a background task that pulls the image and returns the background task id.
After the background task finishes, the agent should dispatch ImagePulled event.
Check images early when creating kernels
When a kernel is scheduled, manager first calls the check-and-pull API. This agent-side API spawns a background task that verifies whether the agent has the required image. If the agent does not have the image, the task pulls it. The API produces an ImagePullStartedEvent when the agent begins pulling the image and an ImagePullFinishedEvent when the task completes.
Consume Image events
Manager consumes ImagePullStartedEvent and ImagePullFinishedEvent events.
When ImagePullStartedEvent consumed, transits kernel status to PULLING.
When ImagePullFinishedEvent consumed, transits kernel status to PREPARED.
start-containers loop
Manager runs a global asynchronous background loop that monitors all kernels with PREPARED status. It starts these kernels.
Subtasks
[x] #3128
[ ] Update USER_RESOURCE_OCCUPYING_KERNEL_STATUSES
Exclude PULLING kernel status from USER_RESOURCE_OCCUPYING_KERNEL_STATUSES since PULLING status kernels don't occupy any resources!
As-is
sequenceDiagram
loop "prepare" loop
activate Manager
Manager->>Manager: Fetch "SCHEDULED" sessions
Manager->>+Agent: Create kernel
opt need to pull
Agent->>Event bus: Produce "Pulling" event
Agent->>Agent: Pull the image and wait till it finishes
Agent->>Event bus: Produce "Preparing" event
end
Agent->>Agent: Create kernel
Agent-->>-Manager: Return kernel creation info
deactivate Manager
end
To-do
sequenceDiagram
loop "check-precond" loop
activate Manager
Manager->>Manager: Fetch "SCHEDULED" sessions and transit status to "PREPARING"
Manager->>Agent: check-and-pull
activate Agent
Note right of Agent: Run in background task
Agent-->>Manager: Return background task id
deactivate Manager
end
opt need to pull
Agent->>Event bus: Produce "Pulling" event
Agent->>Agent: Pull the image and wait till it finishes
end
Agent->>Event bus: Produce "Pull finished" event
deactivate Agent
loop "create-kernel" loop
activate Manager
Manager->>Manager: Fetch "PREPARED" sessions
Manager->>+Agent: Create kernel
Agent->>Agent: Create kernel
Agent-->>-Manager: Return kernel creation info
deactivate Manager
end
Steps
Step 1
PULLING
toCANCELLED
forciblyStep 2
Implement
check-and-pull
API It checks whether an agent has a specific image or not. If the agent does not have the image, it starts a background task that pulls the image and returns the background task id. After the background task finishes, the agent should dispatchImagePulled
event.Check images early when creating kernels When a kernel is scheduled, manager first calls the
check-and-pull
API. This agent-side API spawns a background task that verifies whether the agent has the required image. If the agent does not have the image, the task pulls it. The API produces anImagePullStartedEvent
when the agent begins pulling the image and anImagePullFinishedEvent
when the task completes.Consume Image events Manager consumes
ImagePullStartedEvent
andImagePullFinishedEvent
events. WhenImagePullStartedEvent
consumed, transits kernel status toPULLING
. WhenImagePullFinishedEvent
consumed, transits kernel status toPREPARED
.start-containers
loop Manager runs a global asynchronous background loop that monitors all kernels withPREPARED
status. It starts these kernels.Subtasks
USER_RESOURCE_OCCUPYING_KERNEL_STATUSES
ExcludePULLING
kernel status fromUSER_RESOURCE_OCCUPYING_KERNEL_STATUSES
sincePULLING
status kernels don't occupy any resources!As-is
To-do
Step 3
Step 4