Open ilkyyldz95 opened 3 years ago
Hi @ilkyyldz95
Getting the same labels for all datapoints seems a bit weird. Did you make any changes to the models? You are right about using torch.no_grad for both the CNN and Attention models and we have updated the class files to reflect this. Thanks for pointing this out. In fact, finetuning the XLM Roberta model (while modifying model weights, i.e. not using torch.no_grad) requires a large GPU storage and might lead to a Memory Exceeded error on the device.
Thank you for verifying this. It would be consistent to update your arxiv paper to indicate that XML Roberta is not finetuned. Is there anything else you updated in the code? I am trying to replicate your results for a project, so this would really help.
Hi @ilkyyldz95 ,
During the competition, finetuning XLMRoberta gave slightly better results than training with frozen weights which is why we mentioned it in our system description paper. In the repository, we decided to put up the version with frozen weights as finetuning the large XLMRoberta model frequently runs out of memory on most standard GPUs. Please check if the experiment run by you earlier (without torch.no_grad()) had a memory exceeded issue.
Hi,
I personally never ran into memory issues. I am running the code on a cluster involving GPUs with 512 Gb memory. My main problem, as mentioned in the first post, was that finetuning XLM Roberta hinders training and leads to predicting only one class.
Hi,
It seems from the source code that XLM Roberta is finetuned with the gradient updates based on the LSTM attention model. However, when I follow the README instructions and train the model on hinglish, finetuning XLM Roberta hinders training and leads to predicting only one class. The only setting I could train successfully was when I placed the XLM Roberta forward pass in torch.no_grad() for both CNN and LSTM models.
Can you please clarify this? Thank you,