Closed nlgranger closed 9 months ago
Thanks for adding the issue. IIRC, RPS is used to communicate between distributed processes. We are working on ReadingService
with DataLoader2
, it seems RPC
would be an good option as a RPCReadingService
.
cc: @VitalyFedyunin
I have pushed this idea further with a realistic scenario and it seems that torch.rpc doesn't scale well for a dataloader usage. I observed a ton of CPU overhead both on workers and main script sides. Besides, using rpc in the dataloader conflicts with other uses of rpc since you can't initialize different groups simultaneously, it's one global group.
Instead, I have re-implemented a simple RPC module on top of plain TCP sockets and made the whole thing a standalone library: https://github.com/CEA-LIST/RPCDataloader On my use-case (2 * 8 workers that feed 2 DDP models on another node) it works reliably, scales linearly with little overhead and I am able to keep the GPUs busy.
Unless you see an opportunity to merge this in torchdata I think it can stay as a separate library and we can close this issue.
🚀 The feature
A dataloader which communicates with its workers via torch.distributed.rpc API.
Motivation, pitch
Presently, process-based workers for Dataloader mean the workers live on the same server/PC as the process consuming that data. This incurs the following limitations:
Alternatives
No response
Additional context
A proof of concept is available
here-> https://github.com/CEA-LIST/RPCDataloaderI have not yet tested how efficient this is compared to communicating the preprocessed batch data via process pipes. Obviously the use of shared-memory is lost when the worker is remote but the TensorPipe rpc backend might be able to take advantage of other fast transfer methods (GPUDirect, rmda?).
The load distribution scheme used in this first implementation is round-robin. I have not yet put thoughts on how to make this modifiable both in term of implementation and API.