Closed rao1219 closed 1 year ago
Hi,
Thanks for your work and code.
I have met a strange phenomenon when I use the code, i.e., the training got stuck at a fixed epoch, showing ``loss is nan".
I think the training got stuck is due to CUDA out of memory.
Your training script L32-37 in seq_scripts.py shows that you tend to skip the optimization when encountering nan loss, however if you also skip the loss backpropagation, it will leads to CUDA out of memory because double batch data is squeezed in one forward.
My current solution is to put backward this code
scaler.scale(loss).backward()
before the L31 if condition. (I'm not sure it can solve the problem since my experiment just restarted.)Therefore, I think it might be a bug, or is there other reasons that trigger my issue?
Update: after putting the backward operation ahead, the training runs normally.
Many thanks for your response. I haven't encountered this issue. The outputs 'loss is nan' sometimes arise. But it doesn't cause my code to exit. In fact, if a loss is nan, it shouldn't be backpropagated. Thus i just skip it. If skipping it, the backward gradients are still existed in memory, which may cause 'out of memory'. I wonder if your memory is relatively small, e.g. 11G. Overall, your solution could help you overcome this issue.
Many thanks for your response. I haven't encountered this issue. The outputs 'loss is nan' sometimes arise. But it doesn't cause my code to exit. In fact, if a loss is nan, it shouldn't be backpropagated. Thus i just skip it. If skipping it, the backward gradients are still existed in memory, which may cause 'out of memory'. I wonder if your memory is relatively small, e.g. 11G. Overall, your solution could help you overcome this issue.
my batch size is larger and I occupy about 80% memory during the training. I guess this issue arise when using a large proportion of GPU memory.
Hi,
Thanks for your work and code.
I have met a strange phenomenon when I use the code, i.e., the training got stuck at a fixed epoch, showing ``loss is nan".
I think the training got stuck is due to CUDA out of memory.
Your training script L32-37 in seq_scripts.py shows that you tend to skip the optimization when encountering nan loss, however if you also skip the loss backpropagation, it will leads to CUDA out of memory because double batch data is squeezed in one forward.
My current solution is to put backward this code
scaler.scale(loss).backward()
before the L31 if condition. (I'm not sure it can solve the problem since my experiment just restarted.)Therefore, I think it might be a bug, or is there other reasons that trigger my issue?