Open GenVr opened 1 year ago
cc @DachengLi1
@GenVr This is likely because Pytorch FSDP saves t5 model incorrectly (if you print out the loaded model weight, the encoder embedding or decoder embedding is likely 0, which causing the final predictions to be all 0). Can you try using our postprocessing function? There is another issue on this solving the same problem. Let me know if it works!
@DachengLi1 Thanks, I trained on GPUs with more memory and used the function after the training. I am able to load the model correctly. Now, I have another problem. During the training, I have a zero loss and learning rate:
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
...
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.32}
...
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.67}
...
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 1.02}
...
It seems that the network can't learn anything. My configurations are written in the initial post. I trained both on the dummy.json dataset and on a personal one, having the same results. Do you have any idea about it? Thanks.
@GenVr I met a similar issue that learning rate is 0 in a small dataset. This is because some integer flooring behavior in the huggingface transformer. Can you try warmup ratio =0 (or not give this argument), and let me know what happens?
@DachengLi1 Thanks. I tried removing --warmup_ratio 0.03, I had this:
...
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.07}
...
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.31}
...
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.52}
...
Now I have the LR, but the loss is always zero. Changing the batch size to 1, I can see the loss sometimes different from zero. I tried to change the learning rate to 1e3 but after the first epoch the situation remains the same.
@GenVr Nice to hear that! Let's keep bs=1 for now, I will look into whether bs>1 can cause other problems (Haven't really tested bs>1 because of the GPU memory limit). Can you try bs=1 on your own dataset? The dummy dataset is composed of very simple questions (if you look into it, a lot of them are very similar), so you probably want to see whether this still happens in a more complex dataset.
BTW, remember to change the preprocessed path, otherwise it will read from the file.
@DachengLi1 Thanks. I tried both with BS equal to 1 and greater than one, with my personal dataset. The loss is always zero and it seems the network fails to train (looks untrained). I could try maybe a big public .json dataset to see what happens (?)
@GenVr Interesting.. I haven't seen this before, could you print an input/target tensor before it goes into the trainer to see what are the contents? Is the data processed in a wrong way?
same problem (0 loss from start to finish), both dummy.json and my own dataset
Was this resolved? same problem with a 0 loss.
same problem
same problem
Hi, I'm training a FlanT5 network. The training completes successfully, but when I try to run a simple inference, I have a tensor of zeros, so the prediction is null.
Example:
Output:
I tried to run several trainings, both on flanT5-xl and flanT5-large, both on my personal dataset and a dummy.json dataset.
That's my training configuration:
Any idea what's going on? Thank you.