rajveerb / lotus

Lotus: Characterization of Machine Learning Preprocessing Pipelines via Framework and Hardware Profiling
Other
3 stars 1 forks source link

Communication between process/container workers and the main process #17

Open tshimpi opened 1 year ago

tshimpi commented 1 year ago

Issue for handling the communication between the worker and main processes. Approaches covered:

  1. gRPC
  2. PyTorch RPC
tshimpi commented 1 year ago

Week's update: 1) I have been able to build PyTorch on my local and make changes. 2) Went through the code and mapped out where I have to make changes. 3) @rajveerb had suggested that we make a queue abstraction using gRPC and retain the structure of the code. I feel this is a good idea as the queue can be tested easily as well independently. I am proceeding with this idea. 4) I have created a dataLoader class for gRPC communication. Will be making my changes in the dataloader.py.

rajveerb commented 1 year ago

Need to look into PyTorch's RPC semantics and what it does exactly.

tshimpi commented 1 year ago

Update: Looked into PyTorch RPC. The way it works is that both servers and clients create a process group. Once all the workers involved are up and join the process group, RPC communication is allowed. The worker has a stub function it can call, but the remote function will run.

I have tested it out locally to check if it can handle our tensor data well. I wrote a client and server that communicate via rpc and returned a 128,3,224 tensor back. There were no issues.

I also used my local build of pytorch. So that works fine as well.

We can now integrate it into PyTorch.

rajveerb commented 1 year ago

Can you use PyTorch RPC for replacing preprocessing communication in existing pipeline?

tshimpi commented 1 year ago

Yes. Because we are using tensors, we don't even to marshall and unmarshall the data. PyTorch also provides options of sync and async rpc. We will use sync, as even the original code waits for a worker.

rajveerb commented 1 year ago

Have you tried running the code with RPC for the existing pipeline?

tshimpi commented 1 year ago

RPC Architecture

tshimpi commented 1 year ago

The above figure shows the proposed architecture for the proposed RPC communication. This implementation preserves the exact semantics of the original code without having to share python queues.

Initial Connection: The initial connection between the dataloader and the workers takes place via init_process_group. For it, we require to set a common address to connect to and a port on all the participating machines. The call is a blocking call, till all the participants are connected.

Experiments with PyTorch's RPC framework:

  1. If the connection breaks down, for any reason, the entire init_process_group for all participants has to be restarted if we want to reconnect the failed worker. I have not researched a fix for this. Do not know whether it is a blocker for this approach. @kexinrong @rajveerb @Kadle11
  2. The rest of the data flow works fine. I have started plugging this into the pytorch source code.

I will also research about the exact working of init_process_group as @rajveerb suggested in this week's meeting

Notes on the RPC Architecture:

  1. The handling of the asynchronicity of the callbacks is handled by PyTorch Futures. We can move on with the process after calling a put_index. After the RPC gets the result for the future, it will call the callback which will add it to the local_data_queue in the dataloader.
  2. The data_loader will put the next index to a worker after it pops one from its local_data_queue.
rajveerb commented 1 year ago

Firstly, let's try to mimic the existing code's failure model i.e. how does it handle failure.

Secondly, the approach with restarting the init_process_group is really bad. I am against patching existing logic of init_process_group to bypass this issue, in case it leads to breaking existing behavior of init_process_group.

In the meantime, I would suggest you to go ahead and figure out how init_process_group works.

tshimpi commented 1 year ago

Okay, I'll spend some time on both

  1. Current Code
  2. init_process_group failure
tshimpi commented 1 year ago

Have integrated the RPC code into pytorch dataloader and workers following the above architecture. I am testing it tomorrow. Will give the testing update in the meeting on Wednesday.

tshimpi commented 1 year ago

@rajveerb and I sat to debug the RPC issue. We came to the conclusion that the issue is with forking or spawning a worker from a main process. In cases where this happens, the main process ends up calling the stub RPC on its local. When the worker and main process are started separately, they seem to work fine.

We can try to integrate the container creation and RPC code and see if it works in that case.

rajveerb commented 1 year ago

@Kadle11

Please check prev comment