octree-nn / ocnn-pytorch

Octree-based Sparse Convolutional Neural Networks
MIT License
150 stars 16 forks source link

HRNet Issue #15

Closed harryseely closed 1 year ago

harryseely commented 1 year ago

Hello,

I am trying to use the HRNet adaptation with the following setup:

model = HRNet(dropout=0.5, in_channels=3, out_channels=4, stages=3, interp='linear', nempty=True)

I am also using an octree depth of 5 and a full depth of 2.

I am running into the following error:

Traceback (most recent call last): File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "D:\Sync\DL_Development\Scripts\DL_Biomass\Pytorch\utils\train_dl.py", line 235, in train_model pred = forward_pass(model, batch, rank, cfg) File "D:\Sync\DL_Development\Scripts\DL_Biomass\Pytorch\utils\train_dl.py", line 147, in forward_pass pred = model(features, octree, octree.depth) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\parallel\distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "D:\Sync\DL_Development\Scripts\DL_Biomass\Pytorch\pytorch_models\ocnn_hrnet.py", line 192, in forward logits = self.cls_header(convs, octree, depth) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "D:\Sync\DL_Development\Scripts\DL_Biomass\Pytorch\pytorch_models\ocnn_hrnet.py", line 153, in forward logit = self.header(out) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\container.py", line 204, in forward input = module(input) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\ocnn\modules\modules.py", line 171, in forward out = self.fc(out) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\hseely\Miniconda3\envs\pytorch\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x1024 and 512x256)

I am using pytorch Distributed Data Parallel (DDP). I am able to run LeNet with no issues.

Any idea what might be causing this?

Thanks!

-Harry

wang-ps commented 1 year ago

Please install the latest code and try again. I have tried to train HRNet with 4 GPUs with DistributedDataParallel via the following command, and it works well.

python classification.py --config configs/cls_m40_hrnet.yaml SOLVER.alias time  SOLVER.gpu 0,1,2,3
harryseely commented 1 year ago

After updating the package it now works!