lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
508 stars 150 forks source link

Make the shared memory setting for new sessions more intuitive #1726

Open achimnol opened 11 months ago

achimnol commented 11 months ago

This is a late follow-up to lablup/backend.ai-webui#314.

Currently, the image label ai.backend.resource.min.mem is interpreted as the main memory size, excluding the shared memory size. However, the web UI's resource configuration automatically sets the shared memory size ($S$) to 64 MiB for the main memory sizes ($M$) less than 4 GiB, while our scheduler allocates the sum of the main memory size and the shared memory size ($M+S$).

This makes a confusion when allocating the least amount of memory $M+S = 256\ \mathrm{MiB}$, for instance, because it epic-fails as the web UI sends $M = 192\ \mathrm{MiB}$ and $S = 64\ \mathrm{MiB}$, while the manager's enqueue-session API handler compares the image label ai.backend.resource.min.mem with $M$ only and requires $M \ge 256\ \mathrm{MiB}$.

We are going to update the web UI to hide the detailed shared memory configuration for most use cases, and the memory resource slider will expose $M+S$ with auto-configured $S$ depending on the value of $M + S$.

To better support the above web UI update, let's change the enqueue-session API handler to:

The Client SDK and CLI should still expose the raw configurations as the options. So, let's:

fregataa commented 10 months ago

The current implementation takes either of the specified shmem or the minimum shmem value required by image and adds it to the minimum mem value required by image, and compares it to the user-specified mem value. In other words, it compares (S or ai.backend.resource.min.shmem) + ai.backend.resource.min.mem to M . Are we going to change this behavior to check ai.backend.resource.min.mem < M + S? And how should we handle the case of max resource slot? Should it be ai.backend.resource.max.mem > M + S ?