Closed eric8607242 closed 1 year ago
Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Hey, how did you write your tensor parallelize function if you followed your gpt2 example?
Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?
there is a tensor_parallelize func in gpt exmaple, when it needs people to implement their own tensor_parallelize? @JThh
@JThh Hi, thanks for your response!
I follow the tensor_parallelize
function in the example because I also use the same gpt2 model (hugging face version).
Can you success update the parameter to decrease the loss with the example code?
Hi @eric8607242 , I guess the reason is, if a tensor is all-gathered in the forward pass, its gradient should be reduce-scattered rather than simply sliced.
Hi @kurisusnowdeng, Thanks for your response. I will try to address the issue in this direction!
π Describe the bug
Hello there,
Thanks for this awesome project.
I am currently training a GPT2 model for contrastive learning InfoNCE loss using tensor parallelism. To implement the training codebase, I followed the GPT2_Gemini example.
However, I encountered an issue while using tensor parallelism with a degree of 2, as the parameters were not updating successfully. Nonetheless, upon switching to a degree of 1 with only data parallelism, I was able to successfully update the parameters and achieve a significant decrease in loss.
Can anyone help me to point out how to fix this issue? Big thanks!
I calculate the infoNCE loss with the following codebase:
To gather the data from each process and calculate the infoNCE loss, I apply this GaterLayer.
Environment