google-deepmind / deepmind-research

This repository contains implementations and illustrative code to accompany DeepMind publications
Apache License 2.0
13.31k stars 2.61k forks source link

Training Scripts masked LM #263

Open friesel opened 3 years ago

friesel commented 3 years ago

Do you intend to publish the training scrips for the masked LM as well?

diegolascasas commented 3 years ago

Hi, can you specify which project you're directing your question to?

friesel commented 3 years ago

Sorry, my question is to the Perceiver IO-project-team.

In the NLP-world often the pretrained models are just english or "all the worlds languages". Many users however need inference in non-english languages and have 1 or 2 GPUs rather than TPU-pods, so for them it's most efficient to pretrain only in the language you actually need inference in. So both for pretraining and finetuning it'd be great to have the scripts you used in your pretraining of the masked LM available.

Thx

fding commented 3 years ago

Hi, thanks for your interest in Perceiver IO. We do not plan on open sourcing the training scripts for the masked LM, because the script is heavily tied to our internal infrastructure for training these models at scale. We do have an example training pipeline for ImageNet released as well as the exact configuration we used for language modeling from bytes (in the language modeling colab), which hopefully would be of use if you wish to train a new language model from scratch for other languages.

Do let us know if you have any further questions or if you encounter any issues trying to replicate our work!

friesel commented 3 years ago

Thx for the orientation. I will then get my head around the ImageNet-pipeline and try to adapt that to the NLP case.

codedecde commented 3 years ago

Hi @fding Would it be possible to share some of the tensorboard logs for the Byte level LM pretraining and/or specifics on what the final MLM loss the models converge to(something similar to https://github.com/google-research/electra/issues/3)? I am trying to replicate the Byte level experiments, so these logs would be really useful as a reference. Thank you !