Distributed Data Parallel wrapper

wderekjones commented 5 years ago

I believe that a useful feature would be to implement a wrapper for the pytorch distributed data parallel layer.

My personal motivation for this is to be able to use things like synchronized batch normalization across multiple GPUs but I'm sure that others would find it useful when training models across multiple nodes in cluster environments.

rusty1s commented 5 years ago

The PyG DataParallel wrapper is closely related to the PyTorch one. Hence, you should be fine using the SyncBatchNorm lately introduced here. Training on one big graph across multiple nodes is something which is in the pipeline.

wderekjones commented 5 years ago

Right, I came across that feature as I was working on a solution to the problem where I try to train on thousands of relatively large graphs (each of my gpus has ~16gb of memory, can fit maybe 4 graphs per gpu at any point in time). So in the documentation it points the user to use the DistributedDataParallel layer rather than the DataParallel layer with SyncBatchNorm. If you use BatchNormxD on its own (x = 1,2,3, etc;) then when wrapping that inside of a DataParallel layer, the batch statistics are computed independently for each gpu....i.e. why we need the SyncBatchNorm which computes a single set of batch statistics across all devices. Problem is, there is no pytorch_geometric equivalent of the DistributedDataParallel layer as of now.

One workaround, given the current API, is to only wrap the pytorch_geometric specific layers in the pytorch_geometric.nn.DataParallel layer and then once outputs from those have been collected, pass through the native pytorch layers with non-sync'd batch norm on a single gpu. This can present challenges in writing the code in case at the collate step the input size is still too large for a single gpu.

I'm not sure how straightforward it would be to subclass the existing DistributedDataParallel layer would be and to implement the scatter_ function. However in any case I think that this would be a useful feature.

rusty1s commented 5 years ago

I understand. Although opinions on the usefulness of synched batch norm vary, this is nonetheless a feature which we should support. I see what I can do, but please understand that this is not a top prio.

dblakely commented 5 years ago

The PyG DataParallel wrapper is closely related to the PyTorch one. Hence, you should be fine using the SyncBatchNorm lately introduced here. Training on one big graph across multiple nodes is something which is in the pipeline.

By "in the pipeline," do you mean you're currently working on this or you're planning on it? Wondering because I've been working with some very large graphs and have been planning on implementing support for this on my own. Similarly, I've been working on GPU support for training models for large graphs by graph partitioning.

rusty1s commented 5 years ago

I plan to do it, but I would be very glad if you are interested in implementing this.

dblakely commented 5 years ago

I'm interested and can start working on this sometime soon.

josiahbjorgaard commented 2 years ago

@dblakely Did you have a chance to begin implementation of large graph and/or graph partitioning support?

dblakely commented 2 years ago

Sorry @josiahbjorgaard, I did not.

pyg-team / pytorch_geometric

Distributed Data Parallel wrapper #296