v6d-io / v6d

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)
https://v6d.io
Apache License 2.0
818 stars 117 forks source link

Modifications and memory occupation in vineyard torch module #1859

Closed dashanji closed 2 months ago

dashanji commented 3 months ago

Describe your problem

Thanks for @TrafalgarZZZ reporting. There are two issues when testing vineyard torch module.

The original torch module to put in vineyard will be modified after the put function.

Reproduce

import safetensors
import safetensors.torch

import vineyard
import vineyard.contrib.ml.torch as vineyard_torch

with open("/mnt/stable-diffusion-models/Stable-diffusion/v1-5-pruned-emaonly-1.safetensors", 'rb') as f:
   state_dict = safetensors.torch.load(f.read())

print(state_dict)

client = vineyard.connect("/tmp/vineyard_test.sock")
with vineyard_torch.torch_context(client):
   client.put(state_dict)

print(state_dict)

Original state_dict

...
'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.weight': tensor([[ 0.0402,  0.0049,  0.0031,  ...,  0.0076, -0.0040, -0.0004],
         [ 0.0320, -0.0247,  0.0270,  ...,  0.0014, -0.0266, -0.0196],
         [-0.0072,  0.0229,  0.0050,  ..., -0.0068, -0.0446, -0.0313],
         ...,
         [ 0.0280, -0.0149,  0.0136,  ...,  0.0182, -0.0120, -0.0161],
         [ 0.0343, -0.0128, -0.0234,  ...,  0.0229, -0.0218,  0.0272],
         [ 0.0184,  0.0124,  0.0135,  ..., -0.0094,  0.0302, -0.0117]]),
 ...}

state_dict after put

...
 'cond_stage_model.transformer.text_model.encoder.layers.10.layer_norm2.weight': None, 'model.diffusion_model.middle_block.1.proj_out.bias': None, 'model.diffusion_model.output_blocks.9.0.in_layers.2.weight': None, 'first_stage_model.encoder.mid.block_1.conv2.weight': None, 'model.diffusion_model.output_blocks.4.1.transformer_blocks.0.norm3.bias': None, 'model.diffusion_model.output_blocks.6.1.transformer_blocks.0.ff.net.2.weight': None, 'model.diffusion_model.input_blocks.7.0.out_layers.3.weight': None, 'first_stage_model.decoder.up.2.block.1.norm2.weight': None, 'first_stage_model.encoder.down.1.block.0.conv1.weight': None, 'cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc2.bias': None}

The torch module will be put in vineyard partly when the vineyard memory is not enough, which causes unnecessary memory occupation.

Start vineyardd with 1Gi memory which can't hold all tensors (around 4.5Gi), then run the following code.

Reproduce

import safetensors
import safetensors.torch

import vineyard
import vineyard.contrib.ml.torch as vineyard_torch

with open("/mnt/stable-diffusion-models/Stable-diffusion/v1-5-pruned-emaonly-1.safetensors", 'rb') as f:
   state_dict = safetensors.torch.load(f.read())

client = vineyard.connect("/tmp/vineyard_test.sock")
try:
  with vineyard_torch.torch_context(client):
     client.put(state_dict)
except:
  print(client.status)
InstanceStatus:
    instance_id: 11
    deployment: local
    memory_usage: 1072015920
    memory_limit: 1073741824
    deferred_requests: 0
    ipc_connections: 1
    rpc_connections: 0

Actually, we can't put the incomplete tensor into vineyard.