Open unwritten opened 2 years ago
hmm, are you sure you aren't OOM?
code segment below will report error as titled, under multi gpu training
# rotary embeddings positions = self.get_rotary_embedding(n, device) q, k = map(lambda t: apply_rotary_pos_emb(positions, t), (q, k))
Are you using a specific library for parallel computing? Horovod, PyTorch Lightning, Fairscale, Deepspeed, or PyTorch distributed with model = nn.DataParallel(model)
? I have tested parallel GPU use with both Deepspeed and model = nn.DataParallel(model)
so far. cuDNN errors can be quite difficult to debug. Have you tried on CPU or using .detach()
?
code segment below will report error as titled, under multi gpu training