Issue with DataParallel

harryseely commented 1 year ago

I am trying to increase the depth of my octree to greater than 5, but I an running out of memory. As a workaround I would like to use the Pytorch multi GPU DataParallel wrapper. For However, I am running into these two errors when wrapping any OCNN model in nn.DataParallel:

AssertionError: The shape of input data is wrong.

DataParallel splits the data across the GPUs, which is in my case 2. The error is traced to this line of code in octree_conv.py:

check = tuple(data.shape) == self.in_shape

The tuple(data.shape) is (15646, 3) whereas self.in_shape is (31291, 3).

It appears that DataParallel is half working, because data.shape[0] * 2 = 31292 (1 different from self.in_shape). This means that the data is being split across GPUs, but the self.in_shape is not updated to match...

Any idea what might be stopping DataParallel from working in this case? Could you test this?

Thanks!

wang-ps commented 1 year ago

DistributedDataParallel is supported. For example, run the following command to train with 4 GPUs

python classification.py --config configs/cls_m40.yaml SOLVER.gpu 0,1,2,3

harryseely commented 1 year ago

I need to use DataParallel because I am working on a windows machine. I actually chose to use this O-CNN implementation because it is the only Sparse-CNN implementation that does not require linux (required for Minkowski, Submanifold Sparse CNN, torchsparse, etc.).

wang-ps commented 1 year ago

I think DistributedDataParallel can also be used on Windows machines.
Currently, a batch of octrees is merged into one octree in here. If you would like to use DataParallel, one solution is to delay the merging operation so that each of two models from DataParallel can get a half of a batch.

harryseely commented 1 year ago

Ok this makes sense, thank you!

octree-nn / ocnn-pytorch

Issue with DataParallel #7