how to train maple on multi-GPU

SHIBOYA commented 1 year ago

I have 4 GPUs, how to train maple on multi-GPU？

SHIBOYA commented 1 year ago

when i run maple on my 4 GPUs, it just raise this error:

Traceback (most recent call last): File "train.py", line 237, in main(args) File "train.py", line 179, in main trainer.train() File "/home/shiboya/Dassl.pytorch/dassl/engine/trainer.py", line 386, in train super().train(self.start_epoch, self.max_epoch) File "/home/shiboya/Dassl.pytorch/dassl/engine/trainer.py", line 250, in train self.run_epoch() File "/home/shiboya/Dassl.pytorch/dassl/engine/trainer.py", line 596, in run_epoch loss_summary = self.forward_backward(batch) File "/home/shiboya/maple/trainers/maple.py", line 297, in forward_backward File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 84, in parallel_apply output = results[i] KeyError: 0

can't figure out how to solve this problem

SHIBOYA commented 1 year ago

i guess it can hardly be trained on multi-GPU, there is a new error:

IndexError: Caught IndexError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, kwargs) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/shiboya/maple/trainers/maple.py", line 193, in forward prompts, shared_ctx, deep_compound_prompts_text, deep_compound_prompts_vision = self.prompt_learner() File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/shiboya/maple/trainers/maple.py", line 173, in forward visual_deep_prompts.append(layer(self.compound_prompts_text[index])) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/modules/container.py", line 454, in getitem idx = self._get_abs_string_index(idx) File "/home/shiboya/.conda/envs/maple/lib/python3.8/site-packages/torch/nn/modules/container.py", line 437, in _get_abs_string_index raise IndexError('index {} is out of range'.format(idx)) IndexError: index 0 is out of range

muzairkhattak commented 1 year ago

Hi @SHIBOYA,

Thank you for reaching out.

Our MaPLe repository is configured for single GPU usage only. We note that all our experiments across 11 datasets takes around 10 hours on a single GPU.

Moreover, the Dassl library by default uses DataParallal package which is inherently slow even if we configure code for multi-gpu training. So it is highly recommended to use only single GPU for running MaPLe experiments. Please refer to this issue as well.

Kind regards.

SHIBOYA commented 1 year ago

Thank you for your reply!

muzairkhattak / multimodal-prompt-learning

how to train maple on multi-GPU #15