multi-gpu error - Githubissues

eternaldolphin commented 2 years ago

hello,I want to know whether the code can be trained with multigpu?

the given command uses multi-gpu,like "bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8" but when I run it,it fails,showing the following errors

[] [] are misaligned params in CLIPResNet [] [] are misaligned params in CLIPResNet [] [] are misaligned params in text encoder [] [] are misaligned params in text encoder

and I find in the code that writes the note that

raoyongming commented 2 years ago

Hi, thanks for your interest in our work.

All of our models are trained using multiple GPUs. The printed information ("A, B are misaligned params in C") is used to check whether the pre-trained weights are correctly loaded to model C. A and B are the lists of misaligned parameters. Since A and B are empty lists, it shows that the model has been successfully loaded.

We also didn't use the --gpus argument and set the number of GPUs using the torch.distributed.launch tool (see this file).

Could you provide more outputs of the command? The provided message may not be the reason for the training failure.

eternaldolphin commented 2 years ago

Thank you for your nice reply 1.It seems that when I assign specific gpus, the training fails.I find a similiar issues here .Although I'm not sure the true reason,it does not bother my training at present.Thank you.And more error outputs are like that Traceback (most recent call last): File "/home/rd/anaconda3/envs/denseclip/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/rd/anaconda3/envs/denseclip/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/rd/anaconda3/envs/denseclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/rd/anaconda3/envs/denseclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/rd/anaconda3/envs/denseclip/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/rd/anaconda3/envs/denseclip/bin/python', '-u', './train.py', '--local_rank=3', 'configs/retinanet_denseclip_r101_fpn_1x_coco.py', '--launcher', 'pytorch']' died with <Signals.SIGSEGV: 11>.

2.BTW,I would like to know why the context_length is set to 13? As I notice there is "positional_embedding is tuncated from 77 to 13".

raoyongming commented 2 years ago

The text encoder will pad the input sequence to context_length for parallel computing. We truncate the context sequence since the length of the sequence is always less than 13 in our case (e.g, the sentence "a photo of \<class name>" has less than 13 words for any <class name> in COCO and ADE). We set the length to 13 to reduce the computation and memory consumption during training and inference.

raoyongming / DenseCLIP

multi-gpu error #9