lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
521 stars 154 forks source link

Improve image pulling process #2276

Open fregataa opened 5 months ago

fregataa commented 5 months ago

Steps

[!NOTE] Step 1 targets 24.03, the remaining steps target 24.12.

Step 1

Step 2

Subtasks

As-is

sequenceDiagram
    loop "prepare" loop
        activate Manager
        Manager->>Manager: Fetch "SCHEDULED" sessions
        Manager->>+Agent: Create kernel
        opt need to pull
            Agent->>Event bus: Produce "Pulling" event
            Agent->>Agent: Pull the image and wait till it finishes
            Agent->>Event bus: Produce "Preparing" event
        end
        Agent->>Agent: Create kernel
        Agent-->>-Manager: Return kernel creation info
        deactivate Manager
    end

To-do

sequenceDiagram
    loop "check-precond" loop
        activate Manager
        Manager->>Manager: Fetch "SCHEDULED" sessions and transit status to "PREPARING"
        Manager->>Agent: check-and-pull
        activate Agent
        Note right of Agent: Run in background task
        Agent-->>Manager: Return background task id
        deactivate Manager
    end
    opt need to pull
        Agent->>Event bus: Produce "Pulling" event
        Agent->>Agent: Pull the image and wait till it finishes
    end
    Agent->>Event bus: Produce "Pull finished" event
    deactivate Agent

    loop "create-kernel" loop
        activate Manager
        Manager->>Manager: Fetch "PREPARED" sessions
        Manager->>+Agent: Create kernel
        Agent->>Agent: Create kernel
        Agent-->>-Manager: Return kernel creation info
        deactivate Manager
    end

Step 3

Step 4

achimnol commented 5 months ago

Step 1 targets 24.03 while the remaining steps target 24.09. (cc: @adrysn @xyloon @kmkwon94)