hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.34k stars 4.31k forks source link

[BUG]: Parameters do not be updated under tensor parallel #3494

Closed eric8607242 closed 1 year ago

eric8607242 commented 1 year ago

πŸ› Describe the bug

Hello there,

Thanks for this awesome project.

I am currently training a GPT2 model for contrastive learning InfoNCE loss using tensor parallelism. To implement the training codebase, I followed the GPT2_Gemini example.

However, I encountered an issue while using tensor parallelism with a degree of 2, as the parameters were not updating successfully. Nonetheless, upon switching to a degree of 1 with only data parallelism, I was able to successfully update the parameters and achieve a significant decrease in loss.

Can anyone help me to point out how to fix this issue? Big thanks!

I calculate the infoNCE loss with the following codebase:

def calculate_in_batch_contrastive_loss(x):
    x = torch.cat(GatherLayer.apply(x), dim=0)
    query_x = x.unsqueeze(0)
    key_x = x.unsqueeze(1)
    cos_similarity = F.cosine_similarity(query_x, key_x, -1) * self.temperature
    labels = torch.arange(cos_similarity.size(0)).cuda()

    loss = F.cross_entropy(cos_similarity, labels)
    return loss

To gather the data from each process and calculate the infoNCE loss, I apply this GaterLayer.

class GatherLayer(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        output = [torch.zeros_like(input) \
            for _ in range(dist.get_world_size())
        ]
        dist.all_gather(output, input)
        return tuple(output)

    @staticmethod
    def backward(ctx, *grads):
        input, = ctx.saved_tensors
        grad_out = torch.zeros_like(input)
        grad_out[:] = grads[dist.get_rank()]
        return grad_out

Environment

[GPU]
RTX 3090
RTX 4090

[CUDA]
CUDA == 11.6

[Python package]
colossalai == 0.2.7
torch == 1.13.1
JThh commented 1 year ago

Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Hey, how did you write your tensor parallelize function if you followed your gpt2 example?

EarthXP commented 1 year ago

Hey, how did you write your tensor_parallelize function if you followed our gpt2 example?

there is a tensor_parallelize func in gpt exmaple, when it needs people to implement their own tensor_parallelize? @JThh

eric8607242 commented 1 year ago

@JThh Hi, thanks for your response!

I follow the tensor_parallelize function in the example because I also use the same gpt2 model (hugging face version). Can you success update the parameter to decrease the loss with the example code?

kurisusnowdeng commented 1 year ago

Hi @eric8607242 , I guess the reason is, if a tensor is all-gathered in the forward pass, its gradient should be reduce-scattered rather than simply sliced.

eric8607242 commented 1 year ago

Hi @kurisusnowdeng, Thanks for your response. I will try to address the issue in this direction!