Sharing GPU memory between processes of different compute engines

sighingnow commented 2 years ago

Describe your problem

Vineyard provides in-memory object sharing between diverse compute engines, allowing them to obtain shared objects in a zero-copy fashion. By providing a convenient API (i.e., get/put) and an extensible mechanism (i.e., builder/resolver), vineyard enables efficient data sharing among polyglot compute engines and avoids the cost of replication and data serialization/deserialization.

Currently, such efficient data sharing is only allowed to happen in host memory, while many computational engines are now introducing GPUs to accelerate computation(e.g. DL and CV). Thus we have the same opportunity to perform efficient data exchange in GPU memory without downloading data to host memory. The goal of this task is to extend the current mechanism of data sharing in host memory by vineyard to GPU memory so that:

Different GPU processes can directly share blob in zero-copy fasion.
vineyardd will have a long-running GPU daemon that is responsible for managing the blobs.
Use the existing vineyard meta design to organize the blob into complex objects.

Note that vineyardd on the GPU will be responsible for allocating data and then sharing memory using the CUDA IPC MemHandle, as vineyardd does on main memory. To do this, we need a new GPUBlobStore, which will be responsible for managing the blob data on the GPU, and each blob will also have an additional representation bit field to distinguish whether it is a blob on the GPU or a blob in the host memory.

Additional context

This issue is part of our Alibaba Summer of Code 2022 Program.

Difficulty: Normal
Mentor: Ke Meng (@septicmk)

is-shidian commented 2 years ago

Hi, I'm interested in this issue . Could you please provide more detailed information ,or are there any references to help me get started?

mengke-mk commented 2 years ago

Hi @is-shidian, vineyard currently can only share data on host memory, while many training/inference/sampling engines leverage GPU to accelerate tensor-centric task, so data sharing also has a high probability of occurring between different tasks on GPU. Therefore, the purpose of this topic is to implement data sharing mechanism on GPU, which requires the implementation of:

A new kind of bulk store. (i.e. GPUBulkStore)
Refine the cross-instance migration. (i.e. Add H2D, D2H.)

reference:

Vineyard blob store interfaces: https://github.com/v6d-io/v6d/blob/main/src/server/memory/memory.h
Vineyard migration https://github.com/v6d-io/v6d/blob/main/modules/migrate/vineyard_migrate.cc
CudaIPC: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g01050a29fefde385b1042081ada4cde9

Y-jiji commented 2 years ago

Describe your problem

Vineyard provides in-memory object sharing between diverse compute engines, allowing them to obtain shared objects in a zero-copy fashion. By providing a convenient API (i.e., get/put) and an extensible mechanism (i.e., builder/resolver), vineyard enables efficient data sharing among polyglot compute engines and avoids the cost of replication and data serialization/deserialization.

Currently, such efficient data sharing is only allowed to happen in host memory, while many computational engines are now introducing GPUs to accelerate computation(e.g. DL and CV). Thus we have the same opportunity to perform efficient data exchange in GPU memory without downloading data to host memory. The goal of this task is to extend the current mechanism of data sharing in host memory by vineyard to GPU memory so that:

Different GPU processes can directly share blob in zero-copy fasion.

vineyardd will have a long-running GPU daemon that is responsible for managing the blobs.

Use the existing vineyard meta design to organize the blob into complex objects.

Note that vineyardd on the GPU will be responsible for allocating data and then sharing memory using the CUDA IPC MemHandle, as vineyardd does on main memory. To do this, we need a new GPUBlobStore, which will be responsible for managing the blob data on the GPU, and each blob will also have an additional representation bit field to distinguish whether it is a blob on the GPU or a blob in the host memory.

Additional context

This issue is part of our Alibaba Summer of Code 2022 Program.

Difficulty: Normal

Mentor: Ke Meng (@septicmk)

By GPUBlobStore , did you mean GPUBulkStore ?

mengke-mk commented 2 years ago

@Y-jiji Yes

Y-jiji commented 2 years ago

@septicmk No offense, but is there any scene where things in GPU memory can be kept immutable? (Please give me a punch if I misunderstood something) My previous knowledge is that the memory for GPU is rather expensive and is almost always changing. In typical training settings, the tensors kept on GPU are usually model parameters, and the data (which are the only immutable things) are usually kept in the host memory. Technics like preloading are commonly used to eliminate the IO bottleneck, but I haven't heard about loading data directly from another remote GPU, since they are somewhat elusive...

mengke-mk commented 2 years ago

@Y-jiji For example, A graph task(learning like GNN, analytics like Label propagation) loads the topology of a real-world graph into GPU but only traverses the graph without changing it (Reloading these data into GPU via PCIe may be time-consuming). We have also considered allowing mutable objects in future updates.

Y-jiji commented 2 years ago

@septicmk Thank you for your friendly punch.

sighingnow commented 2 years ago

Implemented in https://github.com/v6d-io/v6d/pull/876.

Thanks @CSWater for the hard working!

v6d-io / v6d