Might be a bug in the training script

hulianyuyy / Temporal-Lift-Pooling

Temporal Lift Pooling for Continuous Sign Language Recognition (ECCV2022)

16 stars 3 forks source link

Might be a bug in the training script #5

Closed rao1219 closed 1 year ago

rao1219 commented 1 year ago

Hi,

Thanks for your work and code.

I have met a strange phenomenon when I use the code, i.e., the training got stuck at a fixed epoch, showing ``loss is nan".

I think the training got stuck is due to CUDA out of memory.

Your training script L32-37 in seq_scripts.py shows that you tend to skip the optimization when encountering nan loss, however if you also skip the loss backpropagation, it will leads to CUDA out of memory because double batch data is squeezed in one forward.

My current solution is to put backward this code scaler.scale(loss).backward() before the L31 if condition. (I'm not sure it can solve the problem since my experiment just restarted.)

Therefore, I think it might be a bug, or is there other reasons that trigger my issue?

rao1219 commented 1 year ago

Hi,

Thanks for your work and code.

I have met a strange phenomenon when I use the code, i.e., the training got stuck at a fixed epoch, showing ``loss is nan".

I think the training got stuck is due to CUDA out of memory.

Your training script L32-37 in seq_scripts.py shows that you tend to skip the optimization when encountering nan loss, however if you also skip the loss backpropagation, it will leads to CUDA out of memory because double batch data is squeezed in one forward.

My current solution is to put backward this code scaler.scale(loss).backward() before the L31 if condition. (I'm not sure it can solve the problem since my experiment just restarted.)

Therefore, I think it might be a bug, or is there other reasons that trigger my issue?

Update: after putting the backward operation ahead, the training runs normally.

hulianyuyy commented 1 year ago

Many thanks for your response. I haven't encountered this issue. The outputs 'loss is nan' sometimes arise. But it doesn't cause my code to exit. In fact, if a loss is nan, it shouldn't be backpropagated. Thus i just skip it. If skipping it, the backward gradients are still existed in memory, which may cause 'out of memory'. I wonder if your memory is relatively small, e.g. 11G. Overall, your solution could help you overcome this issue.

rao1219 commented 1 year ago

Many thanks for your response. I haven't encountered this issue. The outputs 'loss is nan' sometimes arise. But it doesn't cause my code to exit. In fact, if a loss is nan, it shouldn't be backpropagated. Thus i just skip it. If skipping it, the backward gradients are still existed in memory, which may cause 'out of memory'. I wonder if your memory is relatively small, e.g. 11G. Overall, your solution could help you overcome this issue.

my batch size is larger and I occupy about 80% memory during the training. I guess this issue arise when using a large proportion of GPU memory.