Suggestion: Dataloader with RPC-based workers

nlgranger commented 2 years ago

🚀 The feature

A dataloader which communicates with its workers via torch.distributed.rpc API.

Motivation, pitch

Presently, process-based workers for Dataloader mean the workers live on the same server/PC as the process consuming that data. This incurs the following limitations:

the pre-processing workload cannot scale beyond the GPU server capacity
with random sampling, each worker might eventually see all the dataset, which is not cache friendly

Alternatives

No response

Additional context

A proof of concept is available ~~here~~ -> https://github.com/CEA-LIST/RPCDataloader

I have not yet tested how efficient this is compared to communicating the preprocessed batch data via process pipes. Obviously the use of shared-memory is lost when the worker is remote but the TensorPipe rpc backend might be able to take advantage of other fast transfer methods (GPUDirect, rmda?).

The load distribution scheme used in this first implementation is round-robin. I have not yet put thoughts on how to make this modifiable both in term of implementation and API.

ejguan commented 2 years ago

Thanks for adding the issue. IIRC, RPS is used to communicate between distributed processes. We are working on ReadingService with DataLoader2, it seems RPC would be an good option as a RPCReadingService.

cc: @VitalyFedyunin

nlgranger commented 2 years ago

I have pushed this idea further with a realistic scenario and it seems that torch.rpc doesn't scale well for a dataloader usage. I observed a ton of CPU overhead both on workers and main script sides. Besides, using rpc in the dataloader conflicts with other uses of rpc since you can't initialize different groups simultaneously, it's one global group.

Instead, I have re-implemented a simple RPC module on top of plain TCP sockets and made the whole thing a standalone library: https://github.com/CEA-LIST/RPCDataloader On my use-case (2 * 8 workers that feed 2 DDP models on another node) it works reliably, scales linearly with little overhead and I am able to keep the GPUs busy.

Unless you see an opportunity to merge this in torchdata I think it can stay as a separate library and we can close this issue.

pytorch / data