jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.95k stars 374 forks source link

Errors when use custom data to retrain the Vit-transformer #17

Open superxiaoying opened 3 years ago

superxiaoying commented 3 years ago

When use my custom dataset, which contains 6 classes, so I modified the data_utils.py, and change the 'num_classes = 6' in train.py. But I got these errors:

Training (X / X Steps) (loss=X.X): 0%|| 0/33 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. Training (X / X Steps) (loss=X.X): 0%|| 0/33 [00:00<?, ?it/s] Traceback (most recent call last): File "train_trash.py", line 335, in main() File "train_trash.py", line 331, in main train(args, model) File "train_trash.py", line 211, in train loss.backward() File "/root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/autograd/init.py", line 127, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Exception raised from createCublasHandle at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f533ff7077d in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0xcfc185 (0x7f53410d2185 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xb75 (0x7f53410d3065 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xcef217 (0x7f53410c5217 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #4: at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar, c10::Scalar) + 0xf7e (0x7f534242985e in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #5: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0xb3 (0x7f534242b353 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xd14ea0 (0x7f53410eaea0 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0x7b1990 (0x7f5372b9b990 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #8: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f5373383c7c in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #9: at::mm(at::Tensor const&, at::Tensor const&) + 0x4b (0x7f53732d4b0b in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #10: + 0x2c2be8f (0x7f5375015e8f in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #11: + 0x7b1990 (0x7f5372b9b990 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #12: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f5373383c7c in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::Tensor::mm(at::Tensor const&) const + 0x4b (0x7f537346a10b in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x2a6d094 (0x7f5374e57094 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #15: torch::autograd::generated::AddmmBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x2d5 (0x7f5374e5d055 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x30d1017 (0x7f53754bb017 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0x1400 (0x7f53754b6860 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #18: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x451 (0x7f53754b7401 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x89 (0x7f53754af579 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x4a (0x7f53797de13a in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #21: + 0xc819d (0x7f537c30f19d in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #22: + 0x76db (0x7f53a0e6c6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #23: clone + 0x3f (0x7f53a01e8a3f in /lib/x86_64-linux-gnu/libc.so.6)

I guess this error is caused by the labels crossing the boundary, but I can't find where to modify it. Could you please help me fix this problem?

Thank you!