Loss becomes nan after certain epoch.

ynsa commented 2 years ago

Hi! Try to train tiny model on wireframe dataset(for faster checking issue with loss, make dataset smaller): After certain step loss returns nan. Debug shows that some mask in weighted_bce_with_logits is fully zeros, and division on the torch.sum(mask) returns nan.

How to fix?

Will appreciate any help.

vipul2296 commented 2 years ago

Getting the same issue after few epochs loss is nan . Did anyone found any solution ?

harolddu commented 2 years ago

Stucking on the same problem, did you find any solution?

vipul2296 commented 2 years ago

No I haven't find any solution yet. If you could get together and solve the problem it would be great for both of us.

On Sun, 2 Jan 2022, 21:39 harolddu, @.***> wrote:

Stucking on the same problem, did you find any solution?

— Reply to this email directly, view it on GitHub https://github.com/lhwcv/mlsd_pytorch/issues/9#issuecomment-1003738610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQG6OEZ5LMKAMTR3ECAAYALUUB2C3ANCNFSM5HYG6NWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

michelebechini commented 2 years ago

Did anyone solved this bug?

michelebechini commented 2 years ago

@lhwcv do you have any possible solution for the problem of nans noticed here?

lhwcv commented 2 years ago

@lhwcv do you have any possible solution for the problem of nans noticed here?

@michelebechini Hi, Have you tried reduce the learning rate, If still NaN, report your software's info to me maybe(python version, PyTorch ..)? I'll see if this month have time to reproduce and fix it.

michelebechini commented 2 years ago

@michelebechini Hi, Have you tried reduce the learning rate, If still NaN, report your software's info to me maybe(python version, PyTorch ..)? I'll see if this month have time to reproduce and fix it.

I tried with the tiny model and learning rate 0.0005 and it still return NaN. I used Python 3.8.12 and Pytorch 1.10.

michelebechini commented 2 years ago

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss.

It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

ssvicnent commented 2 years ago

same question

fanjing8 commented 2 years ago

I have the same question on wireframe and customed datasets, even though I have already used the pretrained model and reduced the learning rate.

fanjing8 commented 2 years ago

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss.

It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

this may be a solution.

huyhoang17 commented 2 years ago

hi @michelebechini

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss. It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

Did you solve this issue? I trained the model on Wireframe dataset, but the result was not good, the sAP_10 was stuck at ~30 (6x in paper). How can I improve this training process? Thanks

huyhoang17 commented 2 years ago

@VEDFU Did you reproduce the result report in the paper on the Wireframe dataset?

michelebechini commented 2 years ago

hi @michelebechini

The error is in the ground truth mask of the line segmentation that becomes all zero after few epochs when it is used to compute the line segmentation loss. It is not related to the learning rate, I tried a train by excluding the line segmentation loss from the loop and it seems to work fine.

Did you solve this issue? I trained the model on Wireframe dataset, but the result was not good, the sAP_10 was stuck at ~30 (6x in paper). How can I improve this training process? Thanks

I didn't solved the issue because also for me by using this PyTorch implementation the results are not as good as in the original paper. The issue is NOT related to the learning rate but it is simply related to an issue in reading the ground truth masks computed. Moreover notice that the last time that I tried, also the Loss functions have some bugs (no matching loss) and this can strongly affect the final sAP_10.

kushnir95 commented 2 years ago

It looks like I found one of the reasons for this issue. Let's look at this piece of code in mlsd_pytorch/data/wireframe_dset.py: Providing that the input size is 512x512, junction_map and line_map are (256, 256, 1) Numpy arrays. Accordingly, junction_map[0] and line_map[0] are (256, 1). As Numpy does broadcasting whenever it's necessary and possible, the code in rows 334-335 executes without errors, but maps in label[14, ...] and label[15, ...] are incorrect. One of the possible solutions is to change junction_map[0] to junction_map[:, :, 0] and line_map[0] to line_map[:, :, 0].

huyhoang17 commented 2 years ago

Hi @kushnir95 , Thank you so much. It works. The sAP_10 point had increased from 30 to 56 with mlsd_large model.

lhwcv / mlsd_pytorch

Loss becomes nan after certain epoch. #9