keshav22bansal / BAKSA_IITK

Official implementation of paper - "BAKSA at SemEval-2020 Task 9: Bolstering CNN with Self-Attention for Sentiment Analysis of Code Mixed Text" accepted at Proceeding of the 14th International Workshop on Semantic Evaluation.
19 stars 2 forks source link

Finetuning XLM Roberta causes output saturation #3

Open ilkyyldz95 opened 3 years ago

ilkyyldz95 commented 3 years ago

Hi,

It seems from the source code that XLM Roberta is finetuned with the gradient updates based on the LSTM attention model. However, when I follow the README instructions and train the model on hinglish, finetuning XLM Roberta hinders training and leads to predicting only one class. The only setting I could train successfully was when I placed the XLM Roberta forward pass in torch.no_grad() for both CNN and LSTM models.

Can you please clarify this? Thank you,

Harshagarwal19 commented 3 years ago

Hi @ilkyyldz95

Getting the same labels for all datapoints seems a bit weird. Did you make any changes to the models? You are right about using torch.no_grad for both the CNN and Attention models and we have updated the class files to reflect this. Thanks for pointing this out. In fact, finetuning the XLM Roberta model (while modifying model weights, i.e. not using torch.no_grad) requires a large GPU storage and might lead to a Memory Exceeded error on the device.

ilkyyldz95 commented 3 years ago

Thank you for verifying this. It would be consistent to update your arxiv paper to indicate that XML Roberta is not finetuned. Is there anything else you updated in the code? I am trying to replicate your results for a project, so this would really help.

Harshagarwal19 commented 3 years ago

Hi @ilkyyldz95 ,

During the competition, finetuning XLMRoberta gave slightly better results than training with frozen weights which is why we mentioned it in our system description paper. In the repository, we decided to put up the version with frozen weights as finetuning the large XLMRoberta model frequently runs out of memory on most standard GPUs. Please check if the experiment run by you earlier (without torch.no_grad()) had a memory exceeded issue.

ilkyyldz95 commented 3 years ago

Hi,

I personally never ran into memory issues. I am running the code on a cluster involving GPUs with 512 Gb memory. My main problem, as mentioned in the first post, was that finetuning XLM Roberta hinders training and leads to predicting only one class.