Open NimaDL opened 4 years ago
Hey, I haven't tried torch 1.4. Posting the error trace would be helpful. The config needs to update the number of GPUs, and make sure that batch size is a multiple of the GPUs
Traceback (most recent call last):
File "tools/train_net.py", line 196, in find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all forward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb6b1f9d273 in /home/nima/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator
Have a look at defaults.py, IMS_PER_BATCH (should be a multipler of NGPUs)
As for the error you posted, haven't seen that before. Might have to do with different versions of torch, back then I used torch 1.1
@mrlooi Thanks And what torchvision version?
I believe 0.4 - 1.1
@mrlooi The error when running on my own dataset RuntimeError: invalid argument 2: non-empty 3D or 4D input tensor expected but got: [0 x 1 x 28 x 28] at /pytorch/aten/src/THCUNN/generic/SpatialDilatedMaxPooling.cu:37.
IMS_PER_BATCH = 16 NGPU = 2 By changing IMS_PER_BATCH to 2, 4, 8 same issue appears
This is a separate issue. See https://github.com/mrlooi/rotated_maskrcnn/issues/21#issuecomment-600703061
@mrlooi thank you for your great job.
Running the code for multi-gpu has got this error: subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train_net.py', '--local_rank=1', '--config-file', 'configs/rotated/my_config.yaml']' returned non-zero exit status 1.