Closed DEQDON closed 3 years ago
Hi,
That is strange. We have never encountered this issue before. Could you try running the training without cudnn, to check if there is some issue with cudnn? I think you can disable it by adding the following line in run_training function in run_training.py
torch.backends.cudnn.enabled = False
@goutamgmb Sure, I'll do that right now.
@goutamgmb
I put that line into run_training.py
, and the main function becomes this:
if __name__ == '__main__':
multiprocessing.set_start_method('spawn', force=True)
torch.backends.cudnn.enabled = False
main()
Again, the program broke at batch number 204. But this time it's a different error info:
[train: 1, 199 / 2000] FPS: 15.8 (18.4) , Loss/total: 7.45128 , Loss/bb_ce: 4.58791 , ClfTrain/clf_ce: 4.73917
[train: 1, 200 / 2000] FPS: 15.8 (19.1) , Loss/total: 7.44909 , Loss/bb_ce: 4.58672 , ClfTrain/clf_ce: 4.73645
[train: 1, 201 / 2000] FPS: 15.8 (18.6) , Loss/total: 7.44151 , Loss/bb_ce: 4.58608 , ClfTrain/clf_ce: 4.72930
[train: 1, 202 / 2000] FPS: 15.9 (18.8) , Loss/total: 7.43589 , Loss/bb_ce: 4.58731 , ClfTrain/clf_ce: 4.72343
[train: 1, 203 / 2000] FPS: 15.9 (18.9) , Loss/total: 7.43436 , Loss/bb_ce: 4.58744 , ClfTrain/clf_ce: 4.72062
[train: 1, 204 / 2000] FPS: 15.9 (18.9) , Loss/total: 7.42971 , Loss/bb_ce: 4.58731 , ClfTrain/clf_ce: 4.71585
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/generic/THCTensorMath.cu line=16 error=77 : an illegal memory access was encountered
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
File "../ltr/trainers/base_trainer.py", line 70, in train
self.train_epoch()
File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
self.cycle_dataset(loader)
File "../ltr/trainers/ltr_trainer.py", line 61, in cycle_dataset
loss, stats = self.actor(data)
File "../ltr/actors/tracking.py", line 95, in __call__
test_proposals=data['test_proposals'])
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "../ltr/models/tracking/dimpnet.py", line 66, in forward
iou_pred = self.bb_regressor(train_feat_iou, test_feat_iou, train_bb, test_proposals)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "../ltr/models/bbreg/atom_iou_net.py", line 86, in forward
modulation = self.get_modulation(feat1, bb1)
File "../ltr/models/bbreg/atom_iou_net.py", line 162, in get_modulation
fc3_r = self.fc3_1r(roi3r)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/generic/THCTensorMath.cu:16
and
Restarting training from last epoch ...
No matching checkpoint file found
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
File "../ltr/trainers/base_trainer.py", line 70, in train
self.train_epoch()
File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
self.cycle_dataset(loader)
File "../ltr/trainers/ltr_trainer.py", line 55, in cycle_dataset
data = data.to(self.device)
File "../pytracking/libs/tensordict.py", line 24, in apply_attr
return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
File "../pytracking/libs/tensordict.py", line 24, in <dictcomp>
return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
RuntimeError: CUDA error: an illegal memory access was encountered
My major concern now is that the pytorch version and the cuda version are not compatible. May I ask which versions are you running?
I found something new. I changed some of the "cuda:0" code yesterday because I needed to specify a gpu card. Now when I run gpustat -i
, I found that the program took up memory space on 2 cards. I guess I made some mistakes when changing the card number in the code. Could that be the reason that caused this error?
That's possible. I would suggest running with CUDA_VISIBLE_DEVICES (e.g. CUDA_VISIBLE_DEVICES=0,1 python run_training ....) in case you want to run on specific GPU.
I usually use PyTorch 1.2 or 1.4, with cuda version 10.2. However I doubt if the error you have is due to some version mis-match.
Problem solved. This is the issue. I now specify gpu in bash rather than in code with CUDA_VISIBLE_DEVICES=0,1.
Thank you for your help and your wonderful code.
Hi, I'm new to deep learning and pytorch. I'm trying to run PrDiMP training with resnet18 backbone. My machine is
CentOS 7
, withcuda 10.2
,cudnn 7.6.5
, andgcc version 7.3.1
. I'm only training on theGot10k
dataset, andltr/admin/local.py
andltr/train_settings/dimp/prdimp18.py
are modified to fit my dataset.I ran the
install.sh
script to install environment, except withninja-build
mannully installed because CentOS does not use apt-get for installing libraries.The environment works for the PrDiMP tracking task (testing on
Got10k
as well). However, when running training, after 204 batches, the program broke withand
Plus, when the program started, there was a warning about the C++ version.
However, when I run tracking, this C++ warning also exists, but it works anyway.
Since I followed the
install.sh
script, the environment automatically set is (runningconda list -n pytracking
):I'm not sure if the installed
pytorch 1.4.0
andtorchvision 0.5.0
are the recommended versions. Or is thepytorch 1.4.0 py3.7_cuda10.0.130_cudnn7.6.3_0
conflict with mycuda 10.2
andcudnn 7.6.5
? Any help would be appreciated. Thanks!I also tried to reduce
batch_size
from 26 to 8, andsamples_per_epoch
from 26000 to 16000. So the total batch number changes from 1000 to 2000. But still, it broke at batch number 204: