liangheming / cascade_rcnn

pytorch implement of CascadeRCNN,736px(max side),41.2mAP(COCO),21.94fps(RTX 2080TI)
MIT License
20 stars 4 forks source link

/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed. #3

Open fancy-chenyao opened 1 year ago

fancy-chenyao commented 1 year ago

您好,在调试您的代码的过程中遇到了这个错误,我用的自己的数据集,请问问题是出在哪里呢?具体报错如下所示,非常期待您的回复! /root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. 0%| | 0/1125 [00:02<?, ?it/s] Traceback (most recent call last): File "main.py", line 7, in processor.run() File "/root/cascade-rcnn/solver/ddp_mix_solver.py", line 213, in run self.train(epoch) File "/root/cascade-rcnn/solver/ddp_mix_solver.py", line 113, in train targets={"target": targets_tensor, "batch_len": batch_len}) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 702, in forward box_predicts, cls_predicts, roi_losses = self.cascade_head(feature_dict, boxes, valid_size, targets) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 621, in forward boxes, cls, loss = self.roi_heads[i](feature_dict, boxes, valid_size, targets) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 590, in forward cls_loss, box_loss = self.compute_loss(proposals, cls_predicts, box_predicts, targets) File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 564, in compute_loss cls_loss = self.ce(loss_cls_predicts, loss_cls_targets) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1152, in forward label_smoothing=self.label_smoothing) File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: CUDA error: device-side assert triggered ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2085) of binary: /root/miniconda3/envs/my-env/bin/python

yuanfangshang888 commented 1 year ago

我感觉是博主是使用的分布式训练,但是你使用的是单GPU训练,使用才出现了这个问题,因为我看到了你bug中的RROR:torch.distributed.elastic.multiprocessing.api:failed,我的猜测是这样的,你有时间了可以尝试一下这个思路