open-mmlab / mmskeleton

A OpenMMLAB toolbox for human pose estimation, skeleton-based action recognition, and action synthesis.
Apache License 2.0
2.94k stars 1.04k forks source link

GPU problem #364

Open saniazahan opened 4 years ago

saniazahan commented 4 years ago

Hi it seems you code for training the model uses 4 gpus but I have only one. I tried using one but it shows error while calculating back propagation. Is there a to train on a single gpu device. Thanks

saniazahan commented 4 years ago

I am getting this error total_feat_size 800 [09.25.20|00:13:48] Parameters: {'work_dir': 'work_dir', 'config': 'config/st_gcn/ntu-xview/train.yaml', 'phase': 'train', 'save_result': False, 'start_epoch': 0, 'num_epoch': 80, 'use_gpu': True, 'device': [0], 'log_interval': 100, 'save_interval': 10, 'eval_interval': 5, 'save_log': True, 'print_log': True, 'pavi_log': False, 'feeder': 'feeder.feeder.Feeder', 'num_worker': 4, 'train_feeder_args': {'data_path': './data/NTU-RGB-D/xview/train_data.npy', 'label_path': './data/NTU-RGB-D/xview/train_label.pkl', 'debug': False}, 'test_feeder_args': {'data_path': './data/NTU-RGB-D/xview/val_data.npy', 'label_path': './data/NTU-RGB-D/xview/val_label.pkl'}, 'batch_size': 64, 'test_batch_size': 64, 'debug': False, 'model': 'net.st_gcn.Model', 'model_args': {'in_channels': 3, 'num_class': 60, 'dropout': 0.5, 'edge_importance_weighting': True, 'graph_args': {'layout': 'ntu-rgb+d', 'strategy': 'spatial'}}, 'weights': None, 'ignore_weights': [], 'show_topk': [1, 5], 'base_lr': 0.1, 'step': [10, 50], 'optimizer': 'SGD', 'nesterov': True, 'weight_decay': 0.0001}

[09.25.20|00:13:48] Training epoch: 0 /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "main.py", line 31, in p.start() File "~/st-gcn/processor/processor.py", line 113, in start self.train() File "~/phd_codes/st-gcn/processor/recognition.py", line 119, in train loss.backward() File "~/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "~/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Exception raised from createCublasHandle at /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f56c6ecd77d in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0xcec005 (0x7f56c801f005 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xb75 (0x7f56c801fee5 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xcdf097 (0x7f56c8012097 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar, c10::Scalar) + 0xf7e (0x7f56c9376ace in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0xb3 (0x7f56c93785c3 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xd04d20 (0x7f56c8037d20 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x7b1990 (0x7f56fddd9990 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #8: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f56fe5c1c7c in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #9: at::mm(at::Tensor const&, at::Tensor const&) + 0x4b (0x7f56fe512b0b in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0x2c2be8f (0x7f5700253e8f in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #11: + 0x7b1990 (0x7f56fddd9990 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #12: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f56fe5c1c7c in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::Tensor::mm(at::Tensor const&) const + 0x4b (0x7f56fe6a810b in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x2a6d094 (0x7f5700095094 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #15: torch::autograd::generated::AddmmBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x2d5 (0x7f570009b055 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x30d1017 (0x7f57006f9017 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0x1400 (0x7f57006f4860 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x451 (0x7f57006f5401 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x89 (0x7f57006ed579 in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x4a (0x7f5704a171ba in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #21: + 0xc819d (0x7f570751519d in /home/uniwa/students3/students/22905553/linux/anaconda3/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #22: + 0x76db (0x7f57293966db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #23: clone + 0x3f (0x7f57290bfa3f in /lib/x86_64-linux-gnu/libc.so.6)

JiaMingLin commented 2 years ago

It maybe you packing the *.skeleton files in to npy via 120 classes rather than 60 classes. This is my case.