Open tshimpi opened 1 year ago
Week's update:
1) I have been able to build PyTorch on my local and make changes.
2) Went through the code and mapped out where I have to make changes.
3) @rajveerb had suggested that we make a queue abstraction using gRPC and retain the structure of the code. I feel this is a good idea as the queue can be tested easily as well independently. I am proceeding with this idea.
4) I have created a dataLoader class for gRPC communication. Will be making my changes in the dataloader.py
.
Need to look into PyTorch's RPC semantics and what it does exactly.
Update: Looked into PyTorch RPC. The way it works is that both servers and clients create a process group. Once all the workers involved are up and join the process group, RPC communication is allowed. The worker has a stub function it can call, but the remote function will run.
I have tested it out locally to check if it can handle our tensor data well. I wrote a client and server that communicate via rpc and returned a 128,3,224 tensor back. There were no issues.
I also used my local build of pytorch. So that works fine as well.
We can now integrate it into PyTorch.
Can you use PyTorch RPC for replacing preprocessing communication in existing pipeline?
Yes. Because we are using tensors, we don't even to marshall and unmarshall the data. PyTorch also provides options of sync and async rpc. We will use sync, as even the original code waits for a worker.
Have you tried running the code with RPC for the existing pipeline?
The above figure shows the proposed architecture for the proposed RPC communication. This implementation preserves the exact semantics of the original code without having to share python queues.
Initial Connection:
The initial connection between the dataloader and the workers takes place via init_process_group
. For it, we require to set a common address to connect to and a port on all the participating machines. The call is a blocking call, till all the participants are connected.
Experiments with PyTorch's RPC framework:
init_process_group
for all participants has to be restarted if we want to reconnect the failed worker. I have not researched a fix for this. Do not know whether it is a blocker for this approach. @kexinrong @rajveerb @Kadle11 I will also research about the exact working of init_process_group
as @rajveerb suggested in this week's meeting
Notes on the RPC Architecture:
put_index
. After the RPC gets the result for the future, it will call the callback which will add it to the local_data_queue in the dataloader.Firstly, let's try to mimic the existing code's failure model i.e. how does it handle failure.
Secondly, the approach with restarting the init_process_group
is really bad. I am against patching existing logic of init_process_group
to bypass this issue, in case it leads to breaking existing behavior of init_process_group
.
In the meantime, I would suggest you to go ahead and figure out how init_process_group
works.
Okay, I'll spend some time on both
Have integrated the RPC code into pytorch dataloader and workers following the above architecture. I am testing it tomorrow. Will give the testing update in the meeting on Wednesday.
@rajveerb and I sat to debug the RPC issue. We came to the conclusion that the issue is with forking or spawning a worker from a main process. In cases where this happens, the main process ends up calling the stub RPC on its local. When the worker and main process are started separately, they seem to work fine.
We can try to integrate the container creation and RPC code and see if it works in that case.
@Kadle11
Please check prev comment
Issue for handling the communication between the worker and main processes. Approaches covered: