Unable to adapt language model

Thank you for the great research and code! Regarding this section in the research paper:

"For CoLA (linguistic acceptability), examples are scored as the average token log-probability the generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis), we append the token very to each example and restrict the language model’s output distribution to only the words positive and negative and guess the token it assigns higher probability to as the prediction. For RACE (question answering), we pick the answer the generative model assigns the highest average token log-probability when conditioned on the document and question. For DPRD [46] (winograd schemas), we replace the definite pronoun with the two possible referrents and predict the resolution that the generative model assigns higher average token log-probability to the rest of the sequence after the substitution."

I have tried to adapt the language model part of the model in the code to perform the above mentioned tasks. For instance, for CoLA, I fed in the encoded sentences and evaluated the results by thresholding the lm_losses output from the language model. However, the best Matthews correlation coefficient obtained is 0.015, far short of the 0.479 achieved in the paper. How exactly was the model configured to perform the task? Was it purely through the language model output, or was the supervised classification head used?

openai / finetune-transformer-lm

Unable to adapt language model #7