microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 216 forks source link

How to pretrain mDeBERTa ? #93

Open StephennFernandes opened 2 years ago

StephennFernandes commented 2 years ago

how to pretrain mDeBERTa base and small on a custom dataset ?

How to structure the Multilingual lingual dataset.

I am planning to pretrain mDEBERTa specifically on multiple Indian Languages. But i don't have and proper references and resources on how to start pretraining it.

How should one structure the data should the data be shuffled randomly within all the languages where one paragraph is Hindi while the next paragraph is Marathi ? Or should the data be sorted in batches sequential list of languages ?

Can I use Huggingface transformers to pretrain mDeBERTa from scratch ?

stefan-it commented 2 years ago

Hey @StephennFernandes ,

yeah, we are all waiting for the code release (at least for v3 with RTD), e.g. see #71.

It is currently not possible to pre-train it with Transformers, because you would need a special RTD implementation for it and it is missing in Transformers.

StephennFernandes commented 2 years ago

hey @stefan-it is even the v1 or v2 pretraining script not available yet ?

stefan-it commented 2 years ago

v1 and v2 should be supported running this pretraining script:

https://github.com/microsoft/DeBERTa/blob/master/experiments/language_model/mlm.sh

StephennFernandes commented 2 years ago

@stefan-it would the same script even work for Multilingual DeBERTa ?

stefan-it commented 2 years ago

It should work, because you need to pass the vocab (spm.model) file as parameter. So you could just use the mdeberta one (from here: https://huggingface.co/microsoft/mdeberta-v3-base/tree/main) or you can train your own SPM model and pass it to the pre-training script :)

StephennFernandes commented 2 years ago

@stefan-it thanks a ton, btw how should i arange the multilingual corpus, i see in the mlm.sh script that the model takes in train.txt, valid.txt, test.txt as 3 files, but in my case of multilingual pretraining how should i arrange the text of multiple languages ?

Also how do i seperate my corpus as train test and valid is there a certain distributtion to follow, how should i split the corpus and maintain a good uniform contextual distribution such that the test and valid arent skewed ?

stefan-it commented 2 years ago

Hi @StephennFernandes , good question. I'm not sure if the validation corpus is ever well explained in multilingual LM papers, but I would start using approx. the same number of sentences for each language. Technically, you could use the upsamling/downsampling approach that is e.g. mentioned in the XLM paper (https://arxiv.org/abs/1901.07291, section 3.1).

But I think it is more important to do downstream task evaluations on different checkpoints to select the "best" model.

StephennFernandes commented 2 years ago

Hey following to that question, do multilingual Langauge models need to have parallel text corpus ie: same context of content parallely in all langauges, i have read that XLM has a training objective that requires parallel context in all langauges but I haven't seen this in mT5 mBART or specifically mentioned in any multilingual papers, i mean usually most multilingual corpus available doesn't have any parallel relationship

stefan-it commented 2 years ago

I would say yes, this finding shows also an improvement for sequence labeling tasks such as NER:

https://arxiv.org/abs/2106.02171

when using parallel data and mT5 :)

StephennFernandes commented 2 years ago

Ohh , okay thanks a ton on that @stefan-it , but what if i really don't have any parallel corpus. what if i have multilingual corpus with independent context on every language, could i still pretrain models like mT5, mDeBERTa, mBART etc ?

Can I still pretrain them with such corpus ?

pepi99 commented 2 years ago

What would be the difference between pre-training deberta from entirely new dataset, and fine-tuning the pre-trained deberta on MLM?

StephennFernandes commented 2 years ago

@stefan-it any update on when the DeBERTa V3 pretraining script would be available ?

StephennFernandes commented 2 years ago

@stefan-it any update on DeBERTav3 pretraining script ?

stefan-it commented 2 years ago

Hi @StephennFernandes , I'm not a DeBERTa maintainer, so I'm also waiting for the code release 😅

WissamAntoun commented 2 years ago

@stefan-it I got all the pretraining code working for Debertav3 except the Gradient-Disentangled Embedding Sharing thing. Which I guess is one of the main contributions of debertav3. Do you have any idea how to implement it?

stefan-it commented 2 years ago

Hi @WissamAntoun , this sounds really interesting! Would definitely be interested in that (I did some preliminary tests with pretraining a v2 model, but training got stuck in a multi-gpu environment, single gpu is working fine).

However, I did some search over the current DeBERTa codebase, but wasn't able to find any embedding sharing implementation for this gradient-disentagled thing. I've also found an ELECTRA-DeBERTa implementation (https://github.com/smallbenchnlp/ELECTRA-DeBERTa) from the Small-Bench NLP paper (https://arxiv.org/abs/2109.10847) but it seems that they're also using v2 of DeBERTa. So I haven't any clue how to implement it, unfortunately!

StephennFernandes commented 1 year ago

gently pinging the contributors @anukaal @nakosung @namisan @alisafaya ... hey guys could you please provide the DeBERTaV3 and mDeBERTaV3 pretraining code with all the pretraining steps and docs

StephennFernandes commented 1 year ago

Just checking if any update here ?

@anukaal @alisafaya @nakosung @namisan Hey guys please help us out here ... A ton of us are eagerly waiting for the DeBERTaV3 pretraining release

StephennFernandes commented 1 year ago

Any update on DeBERTa-v3 pretraining code being released ?

StephennFernandes commented 1 year ago

Any update here on the official DeBERTa v3 pre training code ?

StephennFernandes commented 1 year ago

@stefan-it , Good news ! Finally DeBERTa v3 pretraining code has been released

WissamAntoun commented 1 year ago

We have released our re-implementation of DeBERTa pertaining code which we used to train CamemBERTa a french LM. https://gitlab.inria.fr/almanach/CamemBERTa

BartWesthoff commented 11 months ago

Does this include V3?

WissamAntoun commented 11 months ago

@BartWesthoff Yes