Closed U-zzd closed 1 year ago
It is an cuda error. please use mmcls.api.init_model
to debug.
How to debug it
from mmcls.apis import inference_model, init_model
config = /YOUR/CONFIG/PATH
img = "./demo/demo.JPEG'
model = init_model(config, None, device="cpu")
result = inference_model(model, img)
if it has error
from mmcls.apis import inference_model, init_model
import torch
config = /YOUR/CONFIG/PATH
img = torch.randn((1, 3, 224, 224))
model = init_model(config, None, device="cpu")
result = model(img, return_loss=False)
The second method does not report an error
so check your outputs from dataloader, check the tensor shape
Just started using this, still don't know how to debug
Branch
master branch (0.24 or other 0.x version)
Describe the bug
/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [32,0,0] Assertion
main()
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/tools/train.py", line 193, in main
train_model(
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/apis/train.py", line 233, in train_model
runner.run(data_loaders, cfg.workflow)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], kwargs)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, kwargs)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(inputs[0], kwargs[0])
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/classifiers/base.py", line 139, in train_step
losses = self(data)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(args, kwargs)
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/classifiers/base.py", line 83, in forward
return self.forward_train(img, kwargs)
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/classifiers/image.py", line 139, in forward_train
img, gt_label = self.augments(img, gt_label)
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/utils/augment/augments.py", line 72, in call
return aug(img, gt_label)
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/utils/augment/mixup.py", line 80, in call
return self.mixup(img, gt_label)
File "/home/zhao/AI_zhao/MMLab/mmclassification-master/mmcls/models/utils/augment/mixup.py", line 73, in mixup
mixed_img = lam img + (1 - lam) img[index, :]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/threading.py", line 917, in run
self._target(self._args, self._kwargs)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/connection.py", line 513, in Client
answer_challenge(c, authkey)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/connection.py", line 762, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/connection.py", line 221, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/zhao/anaconda3/envs/MMLab/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [34,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [35,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [37,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [40,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [41,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [48,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [50,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [51,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [52,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [53,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [55,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [56,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [63,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [2,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [4,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [6,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [8,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [9,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [14,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [16,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [18,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [19,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [20,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [22,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [23,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [24,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [25,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [26,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [28,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [30,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp line=280 error=710 : device-side assert triggered Traceback (most recent call last): File "/home/zhao/AI_zhao/MMLab/mmclassification-master/tools/train.py", line 205, inOther information
When you do binary classification, it works normally, but changing to four classification makes an error