snap-stanford / SATURN

MIT License
108 stars 17 forks source link

ValueError: cannot convert float NaN to integer in loss_string += [f"Avg Loss {species}: {round(np.average(epoch_ave_losses[species]))}"] #68

Closed micheladallangelo closed 1 month ago

micheladallangelo commented 3 months ago

Hi Yanay,

I'm running the tutorial /Vignettes/frog_zebrafish_embryogenesis/Train SATURN.ipynb after successfully running the dataloader notebook.

However when I run the command I get this error:

Pretraining... 0%| | 0/200 [01:01<?, ?it/s] Traceback (most recent call last): File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1072, in trainer(args) File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 654, in trainer pretrain_model = pretrain_saturn(pretrain_model, pretrain_loader, optim_pretrain, File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 258, in pretrain_saturn loss_string += [f"Avg Loss {species}: {round(np.average(epoch_ave_losses[species]))}"] ValueError: cannot convert float NaN to integer

I changed the path to the protein embeddings and the torch.device line (I replace mps for cuda since I have a mac) in the script train-saturn.py. Beside this I didn't do any change.

I tried to debug it by adding a line of code which prints the value of loss for each epoch before calculating the average: print(f"Epoch {epoch} - {species} losses: {epoch_ave_losses[species]}").

What I get is:

Pretraining... 0%| | 0/10 [00:00<?, ?it/s]Epoch 1 - frog losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]) Epoch 1 - zebrafish losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]) Epoch 1: L1 Loss 0.0 Rank Loss 9.588663101196289: 10%| | 1/10 [01:03<09:33, 63.^C

There are a lot of nan values for all the following epochs, I don't know if it is normal. On the other side, Rank Loss is a number and always different. So I was wondering why calculating the average loss for each epoch (epoch_ave_losses) and if I can skip this line of code.

I tried to use np.nan_to_num to replace nan with zero before calculating the average: clean_losses = np.nan_to_num(epoch_ave_losses[species], nan=0.0) loss_string += [f"Avg Loss {species}: {round(np.average(clean_losses))}"]

The code then run, but I end up having another error at the STARTING METRIC TRAINING

Traceback (most recent call last): File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1079, in trainer(args) File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 799, in trainer train(metric_model, loss_func, mining_func, device, File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 131, in train loss.backward() File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'NativeLayerNormBackward0' returned nan values in its 0th output.

I imagine all this is due to these nan values?

I hope the problem is clear, I went through all the issues open to see if someone already raised one similar before asking, but I didn't find any.

Thank you, Michela

Yanay1 commented 3 months ago

Hi Michela,

There shouldn't be any NaN's during training, it's possible that the change to MPS could have caused this but I am not sure.

Any kind of NaN during training is really concerning and will probably completely throw off the model.

micheladallangelo commented 2 months ago

Hi Yanay,

Thank you for answering. I run the model in cpu and the epoch loss is fine, there aren't nan values anymore. So you were right, the problem was the change to MPS, even though I don't understand the reason behind it :)

If it's too slow, I will run it in the cluster where we have cuda installed!