Multi-GPU error - Githubissues

sanfordhsu commented 5 years ago

Hey gsig, thanks for your share!! However, when I try the baseline experiment - i3d_mask_rcnn_ava.py, I got the following error:

RuntimeError: Assertion `THCTensor_(checkGPU)(state, 4, input, target, output, total_weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /opt/conda/conda-bld/pytorch-nightly_1553145032991/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:28

It looks like that the code do not Parallel properly. Could you please help me fix this bug?

gsig commented 5 years ago

What version of PyTorch are you using?

I'm using 1.0.0.dev20181130, and I was able to run the baseline with only a minor change to datasets/ava_mp4.py

I created a new branch for ava related updates, including the one I made to datasets/ava_mp4.py https://github.com/gsig/PyVideoResearch/tree/ava Feel free to send PR there!

My log for running the baseline (just the first few epochs) i3d_mask_rcnn_ava_debug_log.txt

The Ava I3d baseline is still very much work in progress, so keep me updated how it goes.

Best, Gunnar

sanfordhsu commented 5 years ago

Well, I trid the new branch, and found a bug --, I'm using the version 1.0.0.dev20190326, when I look at pytorch source code, there is no numpy_type_map component.

from torch.utils.data.dataloader import numpy_type_map ImportError: cannot import name 'numpy_type_map'

Long before, I changed the code 'from torch.utils.data.dataloader import numpy_type_map' to 'import torch.utils.data.dataloader' to aviod the bug, and the multi-gpu bug I wrote before emerged.

So how can I fix that...

Thanks, Sanford.

gsig commented 5 years ago

Where did you find 'numpy_type_map'? I cannot seem to find it in this repository.

Best, Gunnar

sanfordhsu commented 5 years ago

'Numpy_type_map' import from the external repository Maskrcnn-benchmark...

gsig commented 5 years ago

Thanks for the stack trace. It looks like the code is trying to import from the external/Detectron.pytorch repo, not the external/maskrcnn-benchmark repo. If everything is being run normally, external/Detectron.pytorch/lib should never have been added to the python path. The only place where it should be used is here: https://github.com/gsig/PyVideoResearch/blob/7dd2943dfc79dfbd72f1bf54d4cc9a9edf16ef3c/models/wrappers/maskrcnn_wrapper.py#L65 And this could be safely removed if you are not trying to visualize the bounding boxes.

The way that I am including external libraries can be seen here: https://github.com/gsig/PyVideoResearch/blob/7dd2943dfc79dfbd72f1bf54d4cc9a9edf16ef3c/models/wrappers/maskrcnn_wrapper.py#L22 The process is as follows:

I include the git repository as a git submodule under external/
To avoid importing modules with the same name, I remove any conflicts manually from sys.modules
I add the path to the repo to the pythonpath
I import the modules I need from that repo as normal
I remove the path from the pythonpath
Again, to avoid importing modules with the same name, I remove any conflicts manually from sys.modules

If importing utils.vis from Detectron.pytorch works when run in isolation, then there is something in os.path or sys.modules that's not supposed to be there.

In any case, I pushed a commit to the branch that removes the visualization code, then external/Detectron should never be used.

Let me know if that helps

gsig / PyVideoResearch

Multi-GPU error #10