visionml / pytracking

Visual tracking library based on PyTorch.
GNU General Public License v3.0
3.2k stars 604 forks source link

PrDiMP training "RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR" #221

Closed DEQDON closed 3 years ago

DEQDON commented 3 years ago

Hi, I'm new to deep learning and pytorch. I'm trying to run PrDiMP training with resnet18 backbone. My machine is CentOS 7, with cuda 10.2, cudnn 7.6.5, and gcc version 7.3.1. I'm only training on the Got10k dataset, and ltr/admin/local.py and ltr/train_settings/dimp/prdimp18.py are modified to fit my dataset.

I ran the install.sh script to install environment, except with ninja-build mannully installed because CentOS does not use apt-get for installing libraries.

The environment works for the PrDiMP tracking task (testing on Got10k as well). However, when running training, after 204 batches, the program broke with

[train: 1, 200 / 2600] FPS: 17.0 (46.3)  ,  Loss/total: 7.24056  ,  Loss/bb_ce: 4.56396  ,  ClfTrain/clf_ce: 4.41025
[train: 1, 201 / 2600] FPS: 17.1 (47.1)  ,  Loss/total: 7.23772  ,  Loss/bb_ce: 4.56450  ,  ClfTrain/clf_ce: 4.40706
[train: 1, 202 / 2600] FPS: 16.9 (5.0)  ,  Loss/total: 7.23695  ,  Loss/bb_ce: 4.56430  ,  ClfTrain/clf_ce: 4.40547
[train: 1, 203 / 2600] FPS: 16.8 (7.6)  ,  Loss/total: 7.23343  ,  Loss/bb_ce: 4.56405  ,  ClfTrain/clf_ce: 4.40103
[train: 1, 204 / 2600] FPS: 16.8 (47.0)  ,  Loss/total: 7.23036  ,  Loss/bb_ce: 4.56352  ,  ClfTrain/clf_ce: 4.39753
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "../ltr/trainers/base_trainer.py", line 70, in train
    self.train_epoch()
  File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
    self.cycle_dataset(loader)
  File "../ltr/trainers/ltr_trainer.py", line 61, in cycle_dataset
    loss, stats = self.actor(data)
  File "../ltr/actors/tracking.py", line 95, in __call__
    test_proposals=data['test_proposals'])
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/tracking/dimpnet.py", line 66, in forward
    iou_pred = self.bb_regressor(train_feat_iou, test_feat_iou, train_bb, test_proposals)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/bbreg/atom_iou_net.py", line 86, in forward
    modulation = self.get_modulation(feat1, bb1)
  File "../ltr/models/bbreg/atom_iou_net.py", line 162, in get_modulation
    fc3_r = self.fc3_1r(roi3r)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

and

Restarting training from last epoch ...
No matching checkpoint file found
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "../ltr/trainers/base_trainer.py", line 70, in train
    self.train_epoch()
  File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
    self.cycle_dataset(loader)
  File "../ltr/trainers/ltr_trainer.py", line 55, in cycle_dataset
    data = data.to(self.device)
  File "../pytracking/libs/tensordict.py", line 24, in apply_attr
    return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
  File "../pytracking/libs/tensordict.py", line 24, in <dictcomp>
    return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
RuntimeError: CUDA error: an illegal memory access was encountered

Plus, when the program started, there was a warning about the C++ version.

No matching checkpoint file found
Using /tmp/torch_extensions as PyTorch extensions root...
/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/utils/cpp_extension.py:191: UserWarning:

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  platform=sys.platform))
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/torch_extensions/_prroi_pooling/build.ninja...
Building extension module _prroi_pooling...
ninja: no work to do.
Loading extension module _prroi_pooling...
[train: 1, 1 / 2600] FPS: 0.6 (0.6)  ,  Loss/total: 6.41806  ,  Loss/bb_ce: 4.67661  ,  ClfTrain/clf_ce: 3.84514
[train: 1, 2 / 2600] FPS: 1.1 (48.4)  ,  Loss/total: 6.44306  ,  Loss/bb_ce: 4.55203  ,  ClfTrain/clf_ce: 3.85286
[train: 1, 3 / 2600] FPS: 1.6 (46.3)  ,  Loss/total: 6.29484  ,  Loss/bb_ce: 4.54702  ,  ClfTrain/clf_ce: 3.72006

However, when I run tracking, this C++ warning also exists, but it works anyway.

Since I followed the install.sh script, the environment automatically set is (running conda list -n pytracking):

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
absl-py                   0.11.0                   pypi_0    pypi
blas                      1.0                         mkl
ca-certificates           2020.10.14                    0
cachetools                4.1.1                    pypi_0    pypi
certifi                   2020.11.8        py37h06a4308_0
cffi                      1.14.4                   pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
cudatoolkit               10.0.130                      0
cycler                    0.10.0                   py37_0
cython                    0.29.21          py37h2531618_0
dbus                      1.13.18              hb2f20db_0
decorator                 4.4.2                    pypi_0    pypi
expat                     2.2.10               he6710b0_2
filelock                  3.0.12                   pypi_0    pypi
fontconfig                2.13.0               h9420a91_0
freetype                  2.10.4               h5ab3b9f_0
gdown                     3.12.2                   pypi_0    pypi
glib                      2.66.1               h92f7085_0
google-auth               1.23.0                   pypi_0    pypi
google-auth-oauthlib      0.4.2                    pypi_0    pypi
grpcio                    1.33.2                   pypi_0    pypi
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb31296c_0
icu                       58.2                 he6710b0_3
idna                      2.10                     pypi_0    pypi
imageio                   2.9.0                    pypi_0    pypi
importlib-metadata        3.1.1                    pypi_0    pypi
intel-openmp              2020.2                      254
jpeg                      9b                   h024ee3a_2
jpeg4py                   0.1.4                    pypi_0    pypi
jsonpatch                 1.28                     pypi_0    pypi
jsonpointer               2.0                      pypi_0    pypi
kiwisolver                1.3.0            py37h2531618_0
lcms2                     2.11                 h396b838_0
ld_impl_linux-64          2.33.1               h53a641e_7
libedit                   3.1.20191231         h14c3975_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.1.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_1
libuuid                   1.0.3                h1bed415_2
libxcb                    1.14                 h7b6447c_0
libxml2                   2.9.10               hb55368b_3
lvis                      0.5.3                    pypi_0    pypi
lz4-c                     1.9.2                heb0550a_3
markdown                  3.3.3                    pypi_0    pypi
matplotlib                3.3.2                         0
matplotlib-base           3.3.2            py37h817c723_0
mkl                       2020.2                      256
mkl-service               2.3.0            py37he904b0f_0
mkl_fft                   1.2.0            py37h23d657b_0
mkl_random                1.1.1            py37h0573a6f_0
ncurses                   6.2                  he6710b0_1
networkx                  2.5                      pypi_0    pypi
ninja                     1.10.2           py37hff7bd54_0
numpy                     1.19.2           py37h54aff64_0
numpy-base                1.19.2           py37hfa32c7d_0
oauthlib                  3.1.0                    pypi_0    pypi
olefile                   0.46                     py37_0
opencv-python             4.4.0.46                 pypi_0    pypi
openssl                   1.1.1h               h7b6447c_0
pandas                    1.1.3            py37he6710b0_0
pcre                      8.44                 he6710b0_0
pillow                    8.0.1            py37he98fc37_0
pip                       20.3             py37h06a4308_0
protobuf                  3.14.0                   pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pycocotools               2.0.2                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
pyparsing                 2.4.7                      py_0
pyqt                      5.9.2            py37h05f1152_2
pysocks                   1.7.1                    pypi_0    pypi
python                    3.7.9                h7579374_0
python-dateutil           2.8.1                      py_0
pytorch                   1.4.0           py3.7_cuda10.0.130_cudnn7.6.3_0    pytorch
pytz                      2020.4             pyhd3eb1b0_0
pywavelets                1.1.1                    pypi_0    pypi
pyzmq                     20.0.0                   pypi_0    pypi
qt                        5.9.7                h5867ecd_1
readline                  8.0                  h7b6447c_0
requests                  2.25.0                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.6                      pypi_0    pypi
scikit-image              0.17.2                   pypi_0    pypi
scipy                     1.5.4                    pypi_0    pypi
setuptools                50.3.1           py37h06a4308_1
sip                       4.19.8           py37hf484d3e_0
six                       1.15.0           py37h06a4308_0
sqlite                    3.33.0               h62c20be_0
tb-nightly                2.5.0a20201202           pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tifffile                  2020.11.26               pypi_0    pypi
tikzplotlib               0.9.6                    pypi_0    pypi
tk                        8.6.10               hbc83047_0
torchfile                 0.1.0                    pypi_0    pypi
torchvision               0.5.0                py37_cu100    pytorch
tornado                   6.0.4            py37h7b6447c_1
tqdm                      4.51.0             pyhd3eb1b0_0
urllib3                   1.26.2                   pypi_0    pypi
visdom                    0.1.8.9                  pypi_0    pypi
websocket-client          0.57.0                   pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.35.1             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
zipp                      3.4.0                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.5                h9ceee32_0

I'm not sure if the installed pytorch 1.4.0 and torchvision 0.5.0 are the recommended versions. Or is the pytorch 1.4.0 py3.7_cuda10.0.130_cudnn7.6.3_0 conflict with my cuda 10.2 and cudnn 7.6.5? Any help would be appreciated. Thanks!


I also tried to reduce batch_size from 26 to 8, and samples_per_epoch from 26000 to 16000. So the total batch number changes from 1000 to 2000. But still, it broke at batch number 204:

[train: 1, 200 / 2000] FPS: 18.3 (42.1)  ,  Loss/total: 7.37861  ,  Loss/bb_ce: 4.53875  ,  ClfTrain/clf_ce: 4.65750
[train: 1, 201 / 2000] FPS: 17.9 (3.4)  ,  Loss/total: 7.37244  ,  Loss/bb_ce: 4.53805  ,  ClfTrain/clf_ce: 4.65128
[train: 1, 202 / 2000] FPS: 17.9 (42.1)  ,  Loss/total: 7.36871  ,  Loss/bb_ce: 4.53778  ,  ClfTrain/clf_ce: 4.64706
[train: 1, 203 / 2000] FPS: 18.0 (41.8)  ,  Loss/total: 7.36853  ,  Loss/bb_ce: 4.53980  ,  ClfTrain/clf_ce: 4.64609
[train: 1, 204 / 2000] FPS: 18.0 (41.7)  ,  Loss/total: 7.36477  ,  Loss/bb_ce: 4.54119  ,  ClfTrain/clf_ce: 4.64213
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "../ltr/trainers/base_trainer.py", line 70, in train
    self.train_epoch()
  File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
    self.cycle_dataset(loader)
  File "../ltr/trainers/ltr_trainer.py", line 61, in cycle_dataset
    loss, stats = self.actor(data)
  File "../ltr/actors/tracking.py", line 95, in __call__
    test_proposals=data['test_proposals'])
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/tracking/dimpnet.py", line 66, in forward
    iou_pred = self.bb_regressor(train_feat_iou, test_feat_iou, train_bb, test_proposals)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/bbreg/atom_iou_net.py", line 86, in forward
    modulation = self.get_modulation(feat1, bb1)
  File "../ltr/models/bbreg/atom_iou_net.py", line 162, in get_modulation
    fc3_r = self.fc3_1r(roi3r)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
goutamgmb commented 3 years ago

Hi,

That is strange. We have never encountered this issue before. Could you try running the training without cudnn, to check if there is some issue with cudnn? I think you can disable it by adding the following line in run_training function in run_training.py

torch.backends.cudnn.enabled = False

DEQDON commented 3 years ago

@goutamgmb Sure, I'll do that right now.

DEQDON commented 3 years ago

@goutamgmb

I put that line into run_training.py, and the main function becomes this:

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn', force=True)
    torch.backends.cudnn.enabled = False
    main()

Again, the program broke at batch number 204. But this time it's a different error info:

[train: 1, 199 / 2000] FPS: 15.8 (18.4)  ,  Loss/total: 7.45128  ,  Loss/bb_ce: 4.58791  ,  ClfTrain/clf_ce: 4.73917
[train: 1, 200 / 2000] FPS: 15.8 (19.1)  ,  Loss/total: 7.44909  ,  Loss/bb_ce: 4.58672  ,  ClfTrain/clf_ce: 4.73645
[train: 1, 201 / 2000] FPS: 15.8 (18.6)  ,  Loss/total: 7.44151  ,  Loss/bb_ce: 4.58608  ,  ClfTrain/clf_ce: 4.72930
[train: 1, 202 / 2000] FPS: 15.9 (18.8)  ,  Loss/total: 7.43589  ,  Loss/bb_ce: 4.58731  ,  ClfTrain/clf_ce: 4.72343
[train: 1, 203 / 2000] FPS: 15.9 (18.9)  ,  Loss/total: 7.43436  ,  Loss/bb_ce: 4.58744  ,  ClfTrain/clf_ce: 4.72062
[train: 1, 204 / 2000] FPS: 15.9 (18.9)  ,  Loss/total: 7.42971  ,  Loss/bb_ce: 4.58731  ,  ClfTrain/clf_ce: 4.71585
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/generic/THCTensorMath.cu line=16 error=77 : an illegal memory access was encountered
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "../ltr/trainers/base_trainer.py", line 70, in train
    self.train_epoch()
  File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
    self.cycle_dataset(loader)
  File "../ltr/trainers/ltr_trainer.py", line 61, in cycle_dataset
    loss, stats = self.actor(data)
  File "../ltr/actors/tracking.py", line 95, in __call__
    test_proposals=data['test_proposals'])
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/tracking/dimpnet.py", line 66, in forward
    iou_pred = self.bb_regressor(train_feat_iou, test_feat_iou, train_bb, test_proposals)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "../ltr/models/bbreg/atom_iou_net.py", line 86, in forward
    modulation = self.get_modulation(feat1, bb1)
  File "../ltr/models/bbreg/atom_iou_net.py", line 162, in get_modulation
    fc3_r = self.fc3_1r(roi3r)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/xxx/anaconda3/envs/pytracking/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/generic/THCTensorMath.cu:16

and

Restarting training from last epoch ...
No matching checkpoint file found
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
  File "../ltr/trainers/base_trainer.py", line 70, in train
    self.train_epoch()
  File "../ltr/trainers/ltr_trainer.py", line 80, in train_epoch
    self.cycle_dataset(loader)
  File "../ltr/trainers/ltr_trainer.py", line 55, in cycle_dataset
    data = data.to(self.device)
  File "../pytracking/libs/tensordict.py", line 24, in apply_attr
    return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
  File "../pytracking/libs/tensordict.py", line 24, in <dictcomp>
    return TensorDict({n: getattr(e, name)(*args, **kwargs) if hasattr(e, name) else e for n, e in self.items()})
RuntimeError: CUDA error: an illegal memory access was encountered
DEQDON commented 3 years ago

My major concern now is that the pytorch version and the cuda version are not compatible. May I ask which versions are you running?

DEQDON commented 3 years ago

I found something new. I changed some of the "cuda:0" code yesterday because I needed to specify a gpu card. Now when I run gpustat -i, I found that the program took up memory space on 2 cards. I guess I made some mistakes when changing the card number in the code. Could that be the reason that caused this error?

goutamgmb commented 3 years ago

That's possible. I would suggest running with CUDA_VISIBLE_DEVICES (e.g. CUDA_VISIBLE_DEVICES=0,1 python run_training ....) in case you want to run on specific GPU.

I usually use PyTorch 1.2 or 1.4, with cuda version 10.2. However I doubt if the error you have is due to some version mis-match.

DEQDON commented 3 years ago

Problem solved. This is the issue. I now specify gpu in bash rather than in code with CUDA_VISIBLE_DEVICES=0,1.

Thank you for your help and your wonderful code.