model.eval causing nan values

microsoft / Swin3D

A shift-window based transformer for 3D sparse tasks

MIT License

212 stars 19 forks source link

model.eval causing nan values #17

Open hpc100 opened 1 year ago

hpc100 commented 1 year ago

Thanks for sharing your work ! @Yukichiii @yuxiaoguo I tried to test your code on Semantic3D : In validation step, i get "nan" value in output.

I checked points cloud data, and there is no "nan" in npy files
I used this : print('Nan value .... ???? ', [k for k, v in model.named_parameters() if any(torch.isnan(v.ravel()))]). There is no nan values in train (model.train()) and validation (model.eval())
- torch.where( torch.isnan(coord) == True), torch.where( torch.isnan(feat) == True), torch.where( torch.isnan(batch) == True) return empty list
- So : neither nan value in data nor weigths/biais -> however output filled with nan

Do you have any idea where the problem could come from (layer norm, ....) ?

jaswanthbjk commented 1 year ago

Found any solution?

hpc100 commented 1 year ago

No. @jaswanthbjk do you have the same problems ?

jaswanthbjk commented 1 year ago

Yes, I had the same problem.

But not anymore when I included normals in the features.

hpc100 commented 1 year ago

@jaswanthbj So, you didn't get nan values with XYZ+RGB+Normals or it's for XYZ + Normals ? I tried both, and get nan values when model is set to eval mode. Which points clouds do you use for validation ? (me : domfountain_station1_xyz_intensity and untermaederbrunnen_station3_xyz_intensity) Have you tried intensity features ?

jaswanthbjk commented 1 year ago

@hpc100

Sorry for the confused reply,

I am still gettinig nans in eval mode. But not during training with RGB + Normals + XYZ, which is super weird for me.

hpc100 commented 1 year ago

Found any solution ? @jaswanthbj Have you tried to evaluate the model on cpu // other GPU ?

jaswanthbjk commented 1 year ago

No, However, I run, It's resulting in nan values.

The dataloader is very different between train and val. Maybe digging around that might help solve the issue.

Yukichiii commented 1 year ago

The Nan value may be caused by half-precision. Could you please try to forward the model with full-precision? You can set fp16_mode=0 and use_amp=False.