Loss is always "inf" when running training

nmd95 commented 3 years ago

Hi, I tried to run train.py after cloning the repo + downloading the dataset. As can be seen in the attached screen-shot, the loss remains "inf" even after approx 200 epochs into training.

Is there something to do about it ? what is going on here ?

Any help would be much appreciated !

Cheers,

mialbro commented 3 years ago

Hi @nmd95 thank you for bringing this to my attention. There may be a bug in the error function of this version of the code. I will take a look and get back to you. One thing that you can try in the meantime is reducing the batch size for training.

nmd95 commented 3 years ago

This seems to only be resolved by reducing the batch-size to 1.

mialbro commented 3 years ago

Ok. I will take a look and get back to you on this problem.

From: nmd95 @.> Sent: Friday, April 23, 2021 1:05:09 PM To: mialbro/PointFusion @.> Cc: Robinson, Mark D. @.>; Comment @.> Subject: [EXT] Re: [mialbro/PointFusion] Loss is always "inf" when running training (#1)

This seems to only be resolved by reducing the batch-size to 1.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmialbro%2FPointFusion%2Fissues%2F1%23issuecomment-825793796&data=04%7C01%7Cmdrobinson%40wpi.edu%7C0974db45bf9e4765e8c208d90679f3df%7C589c76f5ca1541f9884b55ec15a0672a%7C0%7C0%7C637547943153543376%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4Kx85C3%2Bt6uNIQgJvXCO9I8bZlRFi1YXqyQJzAZsIss%3D&reserved=0, or unsubscribehttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJT7UGJHBTHC2FYJCZCFNIDTKGSELANCNFSM43FMFEGQ&data=04%7C01%7Cmdrobinson%40wpi.edu%7C0974db45bf9e4765e8c208d90679f3df%7C589c76f5ca1541f9884b55ec15a0672a%7C0%7C0%7C637547943153553368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=c1IGgKDGKULyRkpZgIWgcMzJEgaDdFlebdFz9%2FLkSt4%3D&reserved=0.

mialbro commented 3 years ago

Caused by log(0) Fixed by adding epsilon to log in unsupervised loss function loss = ((loss_offset * pred_scores) - (weight * torch.log(pred_scores + eps)))

mialbro / PointFusion

Loss is always "inf" when running training #1