import safetensors
import safetensors.torch
import torch
import vineyard
import time
from vineyard.contrib.ml.torch import torch_context
state_file = "stable-diffusion-v1-5-pruned-emaonly.safetensors"
with open(state_file, 'rb') as f:
state_dict = safetensors.torch.load(f.read())
client = vineyard.connect("/tmp/vineyard-local.sock")
with torch_context(client):
obj = client.put(state_dict, persist=True, name="state_dict")
client2 = vineyard.connect("/tmp/vineyard-local1.sock")
start = time.time()
with torch_context(rpc_client):
data1 = client2.get(name="state_dict", fetch=True)
end = time.time()
print(f"Migrate Time taken: {end - start} seconds.")
start = time.time()
rpc_client = vineyard.connect(host="127.0.0.1", port=9611)
with torch_context(rpc_client):
data2 = rpc_client.get(name="state_dict")
end = time.time()
assert len(state_dict) == len(data1), "State dict and data1 have different lengths"
assert len(state_dict) == len(data2), "State dict and data2 have different lengths"
for k, v in state_dict.items():
assert torch.equal(v, data1[k]), f"Tensors for data1 key {k} do not match."
assert torch.equal(v, data2[k]), f"Tensors for data2 key {k} do not match."
print(f"RPC Time taken: {end - start} seconds.")
print("All operations are completed.")
The result is as follows.
Migrate Time taken: 12.88494062423706 seconds.
RPC Time taken: 1.898338794708252 seconds.
All operations are completed.
Based profile, we find there several issues in the current recursive implementation of ReceiveRemoteBuffers
When migrate an object with lots of blobs, the callback is too deep, which causes the frequent context switch.
The lambda object captures also increase when it's in a deep recursive, which causes the frequent memory alloc and free.
Ideas:
1. Put the same name again for the new migrated object.
Record a migration map with objected as keys in the vineyard server. It is guaranteed that for the same objectid, there will be only one migration operation at most.
We can use a queue to replace the recursive implementation and record all lambda objects in a shared struct.
Describe your problem
The current implementation of migrate object API has two issues:
1. The name of the old object is not bound to the new migrated object.- When callclient.get(name="data", fetch=True)
, the migration will always trigger even if the object has been migrated.Start with two vineyard instances locally: Instance 1:
./bin/vineyardd --etcd_endpoint=http://127.0.0.1:33774 --socket=/tmp/vineyard-local.sock --rpc_socket_port=9611 --compression=false --reserve_memory=true --size=20Gi
Instance 2:./bin/vineyardd --etcd_endpoint=http://127.0.0.1:33774 --socket=/tmp/vineyard-local1.sock --rpc_socket_port=9612 --compression=false --reserve_memory=true --size=20Gi
Test it with the following code.
The result is as follows.
Based profile, we find there several issues in the current recursive implementation of ReceiveRemoteBuffers
Ideas:
1. Put the same name again for the new migrated object.