google-research / big_transfer

Official repository for the "Big Transfer (BiT): General Visual Representation Learning" paper.
https://arxiv.org/abs/1912.11370
Apache License 2.0
1.51k stars 175 forks source link

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

Open RuojiWang opened 2 years ago

RuojiWang commented 2 years ago

i meet the following error:

ssh://root@10.4.208.56:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar 2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8) 2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1 Files already downloaded and verified Files already downloaded and verified 2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images. 2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images. 2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz 2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs 2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar' 2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT 2022-09-14 11:42:34,812 [INFO] bit_common: Starting training! Traceback (most recent call last): File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in main(parser.parse_args()) File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main logits = model(x) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Process finished with exit code 1

cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?

RuojiWang commented 2 years ago

well, i find the following information:

https://stackoverflow.com/questions/59249563/runtimeerror-module-must-have-its-parameters-and-buffers-on-device-cuda1-devi

just modify the code as following:

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

logger.info(f"Going to train on {device}")

device_ids = [1, 4, 5] device = torch.device("cuda:1") logger.info(f"Going to train on {device}")

train_set, valid_set, train_loader, valid_loader = mktrainval(args, logger)

logger.info(f"Loading model from {args.model}.npz") model = models.KNOWN_MODELS[args.model](head_size=len(valid_set.classes), zero_head=True) model.load_from(np.load(f"{args.model}.npz"))

logger.info("Moving model onto all GPUs") model = torch.nn.DataParallel(model, device_ids=device_ids)

that just solve my problem