Something About Parallel Training

GraceZhuuu commented 8 months ago

Thanks for your great work.

I'm trying to training in parallel, singe machine and two GPUs. The program uses a single graphics card by default, then i set the --distributed
but there occurs an error

=> No resume checkpoint found at 'output/try_d/ReadySetGo/train/model_latest.pth'
Traceback (most recent call last):
  File "main.py", line 675, in <module>
    main()
  File "main.py", line 137, in main
    mp.spawn(train, nprocs=args.ngpus_per_node, args=(args,))
  File "/home/june/anaconda3/envs/FFNeRV/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/june/anaconda3/envs/FFNeRV/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/june/anaconda3/envs/FFNeRV/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/june/anaconda3/envs/FFNeRV/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/june/projects/NeRVs/FFNeRV/main.py", line 318, in train
    len_agg = len(agg_ind)+1
NameError: name 'agg_ind' is not defined

i checked the code, here


    parser.add_argument('--agg_ind', nargs='+', default=[-2,-1,1,2], type=int, help='relative indices of neighboring frames to reference')
    parser.add_argument('--wbit', default=32, type=int, help='QAT weight bit width')

    args = parser.parse_args()

    global agg_ind
    agg_ind = args.agg_ind

GraceZhuuu commented 8 months ago

BTW, i modified here for error sampler option is mutually exclusive with shuffle

@@ -303,7 +303,7 @@ def train(local_rank, args):

     train_dataset = DataSet(train_data_dir, img_transforms,vid_list=args.vid, resol=args.resol, frame_gap=args.frame_gap, )
     train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) if args.distributed else None
-    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batchSize, shuffle=True,
+    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batchSize,
          num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True, worker_init_fn=worker_init_fn)

maincold2 commented 8 months ago

Thank you for your interest and notice! As the baselines used single-batch training, we have not focused on multi-GPU training. Does it perform the same with reduced training time? It would be very useful to implement it!

GraceZhuuu commented 8 months ago

Thank you for your interest and notice! As the baselines used single-batch training, we have not focused on multi-GPU training. Does it perform the same with reduced training time? It would be very useful to implement it!

I have not run through the program under multiple GPUs, but it is okay under a single GPU. I am a novice and trying to solve it. Thanks again for your work.

maincold2 / FFNeRV

Something About Parallel Training #5