quiver-team / quiver-feature

High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph
Apache License 2.0
48 stars 5 forks source link

更省内存的Server以及进程启动 #11

Closed eedalong closed 2 years ago

eedalong commented 2 years ago

我们涉及到如下几个内存类型:

需要在Load数据的时候进行处理使得他们三个是同一个内存,避免内存峰值过高。 @joker-xii

eedalong commented 2 years ago

When doing initilization, we need to load data in a shared memory and pinned to GPU. So this will be done in 3 steps:

  1. Load a pytorch tensor directly to a shared memory.
  2. Create a pytorch tensor from this shared memory
  3. Create ShardTensor & DistTensorPGAS
eedalong commented 2 years ago

300G feature.pt

tensor = torch.load("feature.pt")

sh_mem_tensor = quiver_feature.load("feature.pt")

[meta_data, tensor_data]

meta = torch.load_meta("feature.pt")

tensor.share_mem() register_to_device() shard_tensor = ShardTensor(tensor) ib_register(tensor)