moussaKam / BARThez

A french sequence to sequence pretrained model
Apache License 2.0
57 stars 11 forks source link

Mbarthez training #7

Open SimonBenhamou opened 4 months ago

SimonBenhamou commented 4 months ago

Hello @moussaKam ,

I can't find in the repository the code used to continue mbart pretraining to create mbarthez. Did you make it available somewhere ?

More specifically, I'm interested in understanding how you adapted the mbart tokenizer. It looks like that the checkpoint on huggingface uses the barthez tokenizer, not the mbart tokenizer. So my question is: how did you align the pretrained mbart embeddings with the barthez tokenizer vocab ?