Since BERT is based on Transformer architecture, is there any reason to use BERT embeddings for a NMT model that is already a transformer ?
My take is that BERT embeddings are trained on a very large corpus, they may bring better information than embeddings that are trained at the same time as my NMT model on my little parallel corpus
Since BERT is based on Transformer architecture, is there any reason to use BERT embeddings for a NMT model that is already a transformer ?
My take is that BERT embeddings are trained on a very large corpus, they may bring better information than embeddings that are trained at the same time as my NMT model on my little parallel corpus