Closed micheladallangelo closed 1 month ago
Hi Michela,
There shouldn't be any NaN's during training, it's possible that the change to MPS could have caused this but I am not sure.
Any kind of NaN during training is really concerning and will probably completely throw off the model.
Hi Yanay,
Thank you for answering. I run the model in cpu and the epoch loss is fine, there aren't nan values anymore. So you were right, the problem was the change to MPS, even though I don't understand the reason behind it :)
If it's too slow, I will run it in the cluster where we have cuda installed!
Hi Yanay,
I'm running the tutorial /Vignettes/frog_zebrafish_embryogenesis/Train SATURN.ipynb after successfully running the dataloader notebook.
However when I run the command I get this error:
Pretraining... 0%| | 0/200 [01:01<?, ?it/s] Traceback (most recent call last): File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1072, in
trainer(args)
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 654, in trainer
pretrain_model = pretrain_saturn(pretrain_model, pretrain_loader, optim_pretrain,
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 258, in pretrain_saturn
loss_string += [f"Avg Loss {species}: {round(np.average(epoch_ave_losses[species]))}"]
ValueError: cannot convert float NaN to integer
I changed the path to the protein embeddings and the torch.device line (I replace mps for cuda since I have a mac) in the script train-saturn.py. Beside this I didn't do any change.
I tried to debug it by adding a line of code which prints the value of loss for each epoch before calculating the average: print(f"Epoch {epoch} - {species} losses: {epoch_ave_losses[species]}").
What I get is:
Pretraining... 0%| | 0/10 [00:00<?, ?it/s]Epoch 1 - frog losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]) Epoch 1 - zebrafish losses: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]) Epoch 1: L1 Loss 0.0 Rank Loss 9.588663101196289: 10%| | 1/10 [01:03<09:33, 63.^C
There are a lot of nan values for all the following epochs, I don't know if it is normal. On the other side, Rank Loss is a number and always different. So I was wondering why calculating the average loss for each epoch (epoch_ave_losses) and if I can skip this line of code.
I tried to use np.nan_to_num to replace nan with zero before calculating the average: clean_losses = np.nan_to_num(epoch_ave_losses[species], nan=0.0) loss_string += [f"Avg Loss {species}: {round(np.average(clean_losses))}"]
The code then run, but I end up having another error at the STARTING METRIC TRAINING
Traceback (most recent call last): File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 1079, in
trainer(args)
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 799, in trainer
train(metric_model, loss_func, mining_func, device,
File "/Users/mdallang/unil/workspace/SATURN/Vignettes/frog_zebrafish_embryogenesis/../../train-saturn.py", line 131, in train
loss.backward()
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/opt/miniconda3/envs/saturn/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'NativeLayerNormBackward0' returned nan values in its 0th output.
I imagine all this is due to these nan values?
I hope the problem is clear, I went through all the issues open to see if someone already raised one similar before asking, but I didn't find any.
Thank you, Michela