niuchuangnn / SPICE

Other
208 stars 47 forks source link

SPICE training Error #31

Open kkellyk opened 2 years ago

kkellyk commented 2 years ago

Hi, I have a question to ask. After I follow the steps to install, start training python tools/train_moco.py --img_size 32 --moco-k 12800 --arch resnet18_cifar --save_folder ./results/cifar10/moco_res18_cls --resume ./results/cifar10/moco_res18_cls/checkpoint_last.pth.tar --data_type cifar10 --data ./datasets/cifar10 --all 0in training tutorial

Below is the error:

(u) C:\Users\Kelly>cd SPICE

(u) C:\Users\Kelly\SPICE>python tools/train_moco.py --img_size 32 --moco-k 12800 --arch resnet18_cifar --save_folder ./results/cifar10/moco_res18_cls --resume ./results/cifar10/moco_res18_cls/checkpoint_last.pth.tar --data_type cifar10 --data ./datasets/cifar10 --all 0 Use GPU: 0 for training Traceback (most recent call last): File "tools/train_moco.py", line 453, in main() File "tools/train_moco.py", line 145, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "C:\Users\Kelly\SPICE\tools\train_moco.py", line 170, in main_worker dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\distributed\distributed_c10d.py", line 602, in init_process_group default_pg = _new_process_group_helper( File "C:\Users\Kelly.conda\envs\u\lib\site-packages\torch\distributed\distributed_c10d.py", line 727, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

How do I need to solve thanks Kelly

DOZETS commented 1 year ago

Window platform does not support NCCL, you can change the "NCCL" to "gloo" in config files.