microsoft / MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding https://arxiv.org/pdf/2004.09297.pdf
MIT License
288 stars 33 forks source link

The future is to combine MPNet with other language models innovations #15

Open LifeIsStrange opened 2 years ago

LifeIsStrange commented 2 years ago

For example, it could really make sense to adapt MPNet to preserve PLM but uses the approach of ELECTRA for MLM. SpanBERT has some potential too (e.g on coreference resolution) I believe this could really push the state of the art of accuracy on key tasks.

What do you think? @StillKeepTry @tan-xu

Moreover there are important low hanging fruits that have been consistently ignored by transformer researchers:

The activation function used should probably be https://github.com/digantamisra98/Mish as it is the one that give the most accuracy gains in general. It can give 1% accuracy gains which is huge.

Secondly the optimizer you're using, Adam is flawed and you should use its rectified version: https://github.com/LiyuanLucasLiu/RAdam Moreover it can be optionally combined with a complementary optimizer: https://github.com/michaelrzhang/lookahead

Moreover there are newer techniques for training that yield significant accuracy gains, such as: https://github.com/Yonghongwei/Gradient-Centralization And gradient normalization.

There is a library that integrate all those advances and more here: https://github.com/lessw2020/Ranger21

Accuracy gains in NLP/NLU have reached a plateau. The reason is that researchers works far too much in isolation. They bring N new innovations per years but the number of researchers that attempt to use those innovations/optimization together can be counted on the fingers of one hand.

XLnet has been consistently ignored by researchers, you are the ones that saw the opportunity to combine the best of both worlds of BERT and XLnet. Why stop there? As I said, both transformer/language model wise and activation function/optimizer wise there are a LOT of significant accuracy optimizations to integrate into the successor of MPNet. Aggregating those optimizations could yield a revolutionary language model that would have 5-10% accuracy gains on average over existing SOTA. It would mark history. No one will attempt to combine a wide range of those innovations, you are the only hope. I you do not do it, I'm afraid no one else will and NLU will stagnate for the decade to come.