Open RuojiWang opened 2 years ago
well, i find the following information:
just modify the code as following:
device_ids = [1, 4, 5] device = torch.device("cuda:1") logger.info(f"Going to train on {device}")
train_set, valid_set, train_loader, valid_loader = mktrainval(args, logger)
logger.info(f"Loading model from {args.model}.npz") model = models.KNOWN_MODELS[args.model](head_size=len(valid_set.classes), zero_head=True) model.load_from(np.load(f"{args.model}.npz"))
logger.info("Moving model onto all GPUs") model = torch.nn.DataParallel(model, device_ids=device_ids)
that just solve my problem
i meet the following error:
ssh://root@10.4.208.56:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar 2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8) 2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1 Files already downloaded and verified Files already downloaded and verified 2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images. 2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images. 2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz 2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs 2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar' 2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT 2022-09-14 11:42:34,812 [INFO] bit_common: Starting training! Traceback (most recent call last): File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in
main(parser.parse_args())
File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main
logits = model(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Process finished with exit code 1
cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?