Hello,
When I use command python tools/plain_train_net.py --config-file configs/train_val_bs16_normal_conv.yaml the training stage is fine, but when I try to use multi gpus to train it occus:
`
-02 20:15:24,729] smoke.data.datasets.kitti INFO: Initializing KITTI train set with 3712 files loaded
[2023-03-02 20:15:24,775] smoke.trainer INFO: Start training
Traceback (most recent call last):
File "tools/plain_train_net.py", line 107, in
args=(args,),
File "/home/wangguojun//test/SMOKE/smoke/engine/launch.py", line 53, in launch
daemon=False,
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/home/wangguojun//test/SMOKE/smoke/engine/launch.py", line 88, in _distributed_worker
main_func(args)
File "/home/wangguojun//test/SMOKE/tools/plain_train_net.py", line 95, in main
train(cfg, model, device, distributed)
File "/hoe/wangguojun//test/SMOKE/tools/plain_train_net.py", line 57, in train
tb_log
File "/home/wangguojun//test/SMOKE/smoke/engine/trainer.py", line 73, in do_train
for data, iteration in zip(data_loader, range(start_iter, max_iter)):
TypeError: zip argument #1 must support iteration
(smoke) wangguojun@pc:~//test/SMOKE$ Traceback (most recent call last):
File "", line 1, in
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
len(cache))
`
Hello, When I use command python tools/plain_train_net.py --config-file configs/train_val_bs16_normal_conv.yaml the training stage is fine, but when I try to use multi gpus to train it occus:
python tools/plain_train_net.py --config-file configs/train_val_bs16_normal_conv.yaml --num-gpus 2 --num-machines 1
` -02 20:15:24,729] smoke.data.datasets.kitti INFO: Initializing KITTI train set with 3712 files loaded [2023-03-02 20:15:24,775] smoke.trainer INFO: Start training Traceback (most recent call last): File "tools/plain_train_net.py", line 107, in
args=(args,),
File "/home/wangguojun//test/SMOKE/smoke/engine/launch.py", line 53, in launch
daemon=False,
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/home/wangguojun//test/SMOKE/smoke/engine/launch.py", line 88, in _distributed_worker
main_func(args)
File "/home/wangguojun//test/SMOKE/tools/plain_train_net.py", line 95, in main
train(cfg, model, device, distributed)
File "/hoe/wangguojun//test/SMOKE/tools/plain_train_net.py", line 57, in train
tb_log
File "/home/wangguojun//test/SMOKE/smoke/engine/trainer.py", line 73, in do_train
for data, iteration in zip(data_loader, range(start_iter, max_iter)):
TypeError: zip argument #1 must support iteration
(smoke) wangguojun@pc:~//test/SMOKE$ Traceback (most recent call last): File "", line 1, in
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/wangguojun/miniconda3/envs/smoke/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
len(cache))
`