uni-medical / SAM-Med3D

SAM-Med3D: An Efficient General-purpose Promptable Segmentation Model for 3D Volumetric Medical Image
Apache License 2.0
484 stars 64 forks source link

Errors that occur during training(训练出错) #68

Open shenshaowei opened 3 months ago

shenshaowei commented 3 months ago

(sam3d) a@a-Super-Server:/media/a/DATA/ssw-baselines/SAM-Med3D$ python3 train.py Loaded checkpoint from ckpt/sam_med3d.pth (epoch 0) Epoch: 0/199 0%| | 0/150 [00:00<?, ?it/s]/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/nn/modules/conv.py:605: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) return F.conv3d( 1%|▌ | 1/150 [00:05<14:11, 5.72s/it] Traceback (most recent call last): File "/media/a/DATA/ssw-baselines/SAM-Med3D/train.py", line 520, in main() File "/media/a/DATA/ssw-baselines/SAM-Med3D/train.py", line 479, in main trainer.train() File "/media/a/DATA/ssw-baselines/SAM-Med3D/train.py", line 374, in train epoch_loss, epoch_iou, epoch_dice, pred_list = self.train_epoch(epoch, num_clicks) File "/media/a/DATA/ssw-baselines/SAM-Med3D/train.py", line 294, in train_epoch for step, (image3D, gt3D) in enumerate(tbar): File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/prefetch_generator/init.py", line 116, in next raise next_item File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/prefetch_generator/init.py", line 98, in run for item in self.generator: self.queue.put((True , item)) File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1326, in _next_data return self._process_data(data) File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/_utils.py", line 705, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 316, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 173, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 173, in return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 141, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/home/a/miniconda3/envs/sam3d/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 213, in collate_tensor_fn return torch.stack(batch, 0, out=out) RuntimeError: torch.cat(): input types can't be cast to the desired output type Int

It's been stuck for a long time, is there a solution?(卡了很久了,有解决办法吗?)

shenshaowei commented 3 months ago

1718125464770 这是错误截图

RRouhi commented 3 months ago

I got the same error "RuntimeError: torch.cat(): input types can't be cast to the desired output type Int". Setting the --batch_size 1 solved the issue.

tuan-ld commented 2 months ago

I got the same error "RuntimeError: torch.cat(): input types can't be cast to the desired output type Int". Setting the --batch_size 1 solved the issue.

Thank you, it works!