kenziyuliu / MS-G3D

[CVPR 2020 Oral] PyTorch implementation of "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition"
https://arxiv.org/abs/2003.14111
MIT License
430 stars 96 forks source link

CUDA out of memory while evaluating pretrained #38

Closed fspegni closed 3 years ago

fspegni commented 3 years ago

Hi,

first of all thanks for sharing your poject. I'm trying to replicate your steps, but get stuck due to low GPU memory available (I've about 4GB of memory in my GPU, is that not enough?).

I followed the instructions but get an error at the bash eval_pretrained.sh step, allegedly because of torch is trying to allocate too much memory at once (it asks for a chunk of >800MB, see a log below). Is there any way to by-pass this problem?

(nn) spegni@locanda:~/git/neural-networks/MS-G3D$ bash eval_pretrained.sh 
/home/spegni/git/neural-networks/MS-G3D/main.py:687: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  default_arg = yaml.load(f)
[ Wed May 12 16:34:08 2021 ] Model total number of params: 3194595
Cannot parse global_step from model weights filename
[ Wed May 12 16:34:08 2021 ] Loading weights from pretrained-models/ntu60-xsub-joint-fusion.pt
[ Wed May 12 16:34:08 2021 ] Model:   model.msg3d.Model
[ Wed May 12 16:34:08 2021 ] Weights: pretrained-models/ntu60-xsub-joint-fusion.pt
[ Wed May 12 16:34:08 2021 ] Eval epoch: 1
  0%|                                                   | 0/516 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/spegni/git/neural-networks/MS-G3D/main.py", line 702, in <module>
    main()
  File "/home/spegni/git/neural-networks/MS-G3D/main.py", line 698, in main
    processor.start()
  File "/home/spegni/git/neural-networks/MS-G3D/main.py", line 660, in start
    self.eval(
  File "/home/spegni/git/neural-networks/MS-G3D/main.py", line 580, in eval
    output = self.model(data)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/git/neural-networks/MS-G3D/model/msg3d.py", line 160, in forward
    x = F.relu(self.sgcn1(x) + self.gcn3d1(x), inplace=True)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/git/neural-networks/MS-G3D/model/msg3d.py", line 100, in forward
    out_sum += gcn3d(x)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/git/neural-networks/MS-G3D/model/msg3d.py", line 61, in forward
    x = self.gcn3d(x)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/git/neural-networks/MS-G3D/model/ms_gtcn.py", line 106, in forward
    out = self.mlp(agg)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/git/neural-networks/MS-G3D/model/mlp.py", line 23, in forward
    x = layer(x)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 135, in forward
    return F.batch_norm(
  File "/home/spegni/.virtualenvs/nn/lib/python3.9/site-packages/torch/nn/functional.py", line 2149, in batch_norm
    return torch.batch_norm(
RuntimeError: CUDA out of memory. Tried to allocate 880.00 MiB (GPU 0; 3.82 GiB total capacity; 1.41 GiB already allocated; 434.81 MiB free; 1.82 GiB reserved in total by PyTorch)

NTU RGB+D 60 XSub
Traceback (most recent call last):
  File "/home/spegni/git/neural-networks/MS-G3D/ensemble.py", line 31, in <module>
    with open(os.path.join(arg.joint_dir, 'epoch1_test_score.pkl'), 'rb') as r1:
FileNotFoundError: [Errno 2] No such file or directory: 'pretrain_eval/ntu60/xsub/joint-fusion/epoch1_test_score.pkl'
fspegni commented 3 years ago

Trying to do some profiling of the memory usage (also following this suggestions: https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch) I put some logging in the following functions:

Here is the log file: ms-g3d.log

The last invocation of the forward function is on the following layer:

DEBUG:root:Layer: BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

In the initial ticket I forgot to mention the relevant parameters of my system:

kenziyuliu commented 3 years ago

Hi @fspegni,

Thanks a lot for your interest! If I recall correctly, 4GB memory should be well enough for just testing the pretrained models. One quick thing you can try is just to reduce test_batch_size in the corresponding config files (only training should be sensitive to batch size).

fspegni commented 3 years ago

Thanks, I was able to accomplish this task by adding --test-batch-size N (for N=2 or N=16, on two different platforms with different GPUs) when invoking the main.py script inside eval_pretrained.sh script. Thanks

fspegni commented 3 years ago

Since I was able to run the tests by adjusting the parameter, I self-close the issue. Thanks for helping