Muti-gpus training error

brentonlin commented 2 months ago

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 18697.44it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 18428.40it/s]
/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
images:  torch.Size([4, 1, 240, 480, 480])
torch.Size([1, 512, 20, 20, 20])
test
torch.Size([1, 112, 512])
Traceback (most recent call last):
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/main.py", line 118, in <module>
    main()
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/main.py", line 114, in main
    trainer.train()
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/modules/trainer.py", line 55, in train
    result = self._train_epoch(epoch)
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/modules/trainer.py", line 169, in _train_epoch
    output = self.model(images, reports_ids, mode='train')
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/models/ct2rep.py", line 38, in forward_ct2rep
    att_feats, fc_feats = self.visual_extractor(images)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/home/mumulin/CT2Rep/CT2Rep/modules/visual_extractor.py", line 13, in forward
    patch_feats = self.model(images, return_encoded_tokens=True)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/home/mumulin/CT2Rep/ctvit/ctvit/ctvit.py", line 524, in forward
    tokens = self.to_patch_emb(video)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 190, in forward
    return F.layer_norm(
  File "/home/mumulin/.conda/envs/ct2rep/lib/python3.10/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

Hi author, I have a problem with multiple gpus training, do you know how to fix it or could you tell what did I do wrong? The command I used is "python main.py --max_seq_length 300 --threshold 10 --epochs 100 --save_dir results/our_data/ --step_size 1 --gamma 0.8 --batch_size 4 --d_vf 512 --n_gpu 4"

hari3100 commented 2 months ago

Hi @brentonlin , I hope you are doing well, I'm commenting in hopes that you have already ran the CT2Rep model on your own system, could you help me with the inference of this model i have a few questions as well regarding the input size, etc. But the main thing I'm looking for is how can I run the model, I'm new to medical imaging ,and I'm not understanding the code which is already present in the repo, could you maybe share some inference code you wrote, or guide me on how to do it ?....any feedback or guidance will be greatly appreciated...Thank you for your time!

Andyfever123 commented 2 months ago

@brentonlin I also encountered this problem, have you solved it? Could you please share your solution with me?Thanks a lot

ibrahimethemhamamci / CT2Rep

Muti-gpus training error #14