"NaN or Inf found in input tensor"

EnriqueAlbalate commented 6 months ago

While executing the code with the three UNET approaches were proposed in this repository, the model returns fully NaN tensors as output, which leads to the message on the title "NaN or Inf found in input tensor" when the program tries to calculate the loss. I think the problem could be related to the Automatic Mixed Precission (AMP) (right now it is set to 16 bit), but I'm not sure.

Could you help me? What can I do? Thank you very much in advance.

isaaccorley commented 6 months ago

I'm not able to reproduce this. What versions of torch/lightning/cuda/cudnn/etc are you using?

EnriqueAlbalate commented 6 months ago

torchgeo==0.6.0.dev0 kornia==0.7.2 lightning==2.2.2 pandas==2.2.2 tqdm==4.66.2 numpy==1.26.4 matplotlib==3.8.4 pillow==10.3.0 torch==2.1.2 segmentation_models_pytorch==0.3.3 torchmetrics==1.2.0 torchvision==0.16.2 image_bbox_slicer==0.4 einops==0.7.0 timm==0.9.2

Mi cuda version is 12.2

I have been able to run succesfully the code changing the precision to "32-true" instead of "16-mixed" (I read this was the default precision in the Trainer script). I don't know if this could affect the results.

Also I have checked the process only need 800MB even increasing the batch size, so I think the code just train with images one by one. Can you confirm that?

Thanks for your response

isaaccorley commented 6 months ago

Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust.

EnriqueAlbalate commented 6 months ago

Although I change the batch sise with the command line parameter you said, it seems like the model process just one by one the 7120 training images (I'm using Levir-CD in this case as my dataset). I have printed the batch tensor shape and its batch dimension is 1.

However, I will try to adapt all versions whether I find out the results have been affected after changing the Trainer precision parameter.

El mié., 24 abr. 2024 14:25, Isaac Corley @.***> escribió:

Might be that you're using CUDA 12.2. I'm using 11.8. Make sure you install PyTorch that's built with 12.2. They have some instructions on their website for this. The batch size defaults to 8. The train script has a --batch_size argument you can adjust.

— Reply to this email directly, view it on GitHub https://github.com/isaaccorley/a-change-detection-reality-check/issues/6#issuecomment-2074827697, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAYGIUFBV6HP2JD7HKGUOXLY66QE7AVCNFSM6AAAAABGWU7L3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHAZDONRZG4 . You are receiving this because you authored the thread.Message ID: @.*** com>

isaaccorley commented 6 months ago

I'm not able to reproduce this either. Where are you printing the batch?

EnriqueAlbalate commented 6 months ago

Inside the training_step() method on the change_detection.py script.

isaaccorley / a-change-detection-reality-check

"NaN or Inf found in input tensor" #6