Your code has been very helpful in getting me started.
I had some questions in between using the code and tried to contact you via email or Twitter, but it didn't seem to work, so I'm writing here.
When I change the entry in the ds_zero3.yaml file [num_processes: 2] from the existing 1 to 2 for a process running on an environment of [A6000 x 2].
the following error occurs.
This occurs after training for the entire STEP has ended.
accelerator.gather_for_metrics((prediction_scores, batch["labels"])) : [[0], tensor([0], device='cuda:0'), [0], tensor([0], device='cuda:1')]
prediction_scores : [0]
batch["labels"] : tensor([0], device='cuda:1')
'''
peft_lora_embedding_semantic_search.py", line 551, in main
prediction_scores, references = accelerator.gather_for_metrics((prediction_scores, batch["labels"]))
ValueError: too many values to unpack (expected 2)
'''
My guess is that the values computed on the 2 GPUs are not being combined into one, but are being passed to each other, so the method that was expecting 2 params is getting 4 values and throwing an error.
I need help to solve this part :(
P/S : Is there any other way to contact you (email, for example)?
Is there any way to use multiple GPUs?
Your code has been very helpful in getting me started. I had some questions in between using the code and tried to contact you via email or Twitter, but it didn't seem to work, so I'm writing here.
When I change the entry in the ds_zero3.yaml file [num_processes: 2] from the existing 1 to 2 for a process running on an environment of [A6000 x 2]. the following error occurs.
This occurs after training for the entire STEP has ended.
''' accelerator.gather_for_metrics((prediction_scores, batch["labels"])) : [[0], tensor([0], device='cuda:0'), [0], tensor([0], device='cuda:1')] prediction_scores : [0] batch["labels"] : tensor([0], device='cuda:0')
accelerator.gather_for_metrics((prediction_scores, batch["labels"])) : [[0], tensor([0], device='cuda:0'), [0], tensor([0], device='cuda:1')] prediction_scores : [0] batch["labels"] : tensor([0], device='cuda:1') ''' peft_lora_embedding_semantic_search.py", line 551, in main
prediction_scores, references = accelerator.gather_for_metrics((prediction_scores, batch["labels"])) ValueError: too many values to unpack (expected 2) '''
My guess is that the values computed on the 2 GPUs are not being combined into one, but are being passed to each other, so the method that was expecting 2 params is getting 4 values and throwing an error.
I need help to solve this part :(
P/S : Is there any other way to contact you (email, for example)?