mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Investigate distillation quality gap #231

Open eu9ene opened 10 months ago

eu9ene commented 10 months ago

After training en-hu we noticed a somewhat larger quality gap in 4 BLEU points between the teacher and student models.

It’s 24.8 for the quantized and fine-tuned student vs 30.2 BLEU for the teacher enseble on flores-test dataset. For example for en-nl we had 27.0 vs 28.3.

The training looks more or less normal, but a little less smooth than usual.

en-hu:

Screenshot 2023-10-24 at 11 21 48 AM

en-nl

Screenshot 2023-10-24 at 10 39 02 AM

We had a larger and probably a higher quality dataset for en-nl.

We should investigate further whether it’s a pipeline issue, a config issue or a data issue.

eu9ene commented 1 month ago

The only idea I have at the moment is to try utilizing more monolingual data that the model hasn't seen. For example, we have pretty small distillation gap for da-en. The training corpus was 161M before filtering and we added 139M monolingual sentences from more diverse sources compared to regular news-crawl that we use for en-xx pairs:

  # The monolingual data contains:
  #   ~139,436,127 sentences
  mono-src:
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_1.txt.zst  # 65,099,327 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_2.txt.zst # 16,579,852 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-nllb/nllb-mono-da.txt.zst # 57,756,948 sentences

For en-lt we had 76M sentences in the original training corpus and regular 200M of English news-crawl:

  # The monolingual data contains:
  #   ~195,823,002 sentences
  mono-src:
  - news-crawl_news.2007  #           ~1,557,522 sentences (176M)
  - news-crawl_news.2008 #           ~5,389,380 sentences (609M)
  - news-crawl_news.2009 #           ~6,557,522 sentences (741M)
  - news-crawl_news.2010 #           ~3,247,787 sentences (367M)
  - news-crawl_news.2011 #           ~6,318,584 sentences (714M)
  - news-crawl_news.2012 #           ~6,407,079 sentences (724M)
  - news-crawl_news.2013 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2014 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2015 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2016 #           ~7,982,300 sentences (902M)
  - news-crawl_news.2017 #          ~11,504,424 sentences (1.3G)
  - news-crawl_news.2018 #           ~7,920,353 sentences (895M)
  - news-crawl_news.2019 #          ~17,699,115 sentences (2.0G)
  - news-crawl_news.2020 #          ~22,123,893 sentences (2.5G)
  - news-crawl_news.2021 #          ~21,238,938 sentences (2.4G)
  - news-crawl_news.2022 #          ~23,008,849 sentences (2.6G)
  - news-crawl_news.2023 #          ~23,008,849 sentences (2.6G)

So we could mine some mono data for Enslish from HPLT and NLLB. Then we can increase and diversify the mono set for distillation.

See the paper From Research to Production and Back: Ludicrously Fast Neural Machine Translation

2.1 Knowledge distillation with noisy backward-forward translation In our experience, student training benefits from forward-translated data that was not seen during teacher training. Since we do not have access to additional monolingual source data, we generate noisy back-translated sentences (Edunov et al.,2018), one set per inverse teacher model

@marco-c @gregtatum FYI

eu9ene commented 1 month ago

@gregtatum let's prepare NLLB and HPLT data for English in the same way we did for other languages. Then I'll rerun distillation with the more diverse dataset to check the hypothesis. The monolingual shards shouldn't be too big, I'd say ideally not bigger than 50M sentences so that we can add as many as needed to have a good mix.

eu9ene commented 1 month ago

Another hypothesis is that our on-the-fly data augmentation affects quality more for worse teacher models. The experiment here would be to disable OpusTrainer augmentations and see how it performs. It's easier to check but the fix would be very complex as we would need to move augmentations from training to preparing corpus for translation.

gregtatum commented 1 month ago

I created 3 separate issues for different lines of investigation.

gregtatum commented 1 month ago

I did some light analysis of our recent runs, and their distillation gap vs the sentence counts.

https://docs.google.com/spreadsheets/d/1l459Ui9J7ccdP6UMd1qDy51L8Uar2aZWbYWxOGcQqXA/edit?gid=1859623642#gid=1859623642

Data Source  Correllation
All monolingual data 0.331
Newscrawl -0.215
HPLT 0.295
NLLB 0.385
HPLT+NLLB 0.421

The amount of data is pretty low but, it's an early signal that HPLT+NLLB contribute more to better distillation, and there is a relationship to more monolingual data to a lower distillation gap.

gregtatum commented 1 month ago

790 Here's another idea on applying fluency similar to HPLT to our translations.

gregtatum commented 2 weeks ago
Data Type COMET Teacher COMET Student
Parallel 0.782 0.433
Backtranslations 0.095
Distillation 0.581
Total Data 0.645 0.647

I did another look at correlations on our most recent runs, excluding English monolingual data from the analysis, as it was all the same dataset, and wouldn't show correlations.

My interpretation here is that parallel data is still the best, followed by distillation data, and finally back translation size is only weakly affecting COMET quality. I'd be curious if back translations would correlate higher with fluency scores rather than just translation quality.

Here is the raw data:

COMET Teacher COMET Student Gap - Teacher to Student Parallel Distillation Backtranslation Total Data
0.9013 0.8900 -0.0113 77,673,571 122,958,567 380,595,455 581,227,593
0.8946 0.8700 -0.0246 67,190,349 296,625,390 380,607,008 744,422,747
0.8934 0.8700 -0.0234 86,586,079 178,623,209 380,607,008 645,816,296
0.8817 0.8600 -0.0217 50,790,274 109,841,918 380,607,008 541,239,200
0.8791 0.8600 -0.0191 27,638,186 23,251,451 380,607,008 431,496,645
0.8979 0.8500 -0.0479 59,206,140 384,244,370 380,607,008 824,057,518
0.8765 0.8500 -0.0265 35,879,023 71,024,429 380,607,008 487,510,460
0.8763 0.8500 -0.0263 35,295,006 60,127,368 380,607,008 476,029,382
0.8757 0.8500 -0.0257 42,547,739 267,167,662 380,607,008 690,322,409
0.8759 0.8400 -0.0359 18,641,618 39,043,989 380,607,008 438,292,615
0.8667 0.8400 -0.0267 33,800,821 74,692,280 380,607,008 489,100,109
0.8871 0.8300 -0.0571 104,589,182 361,141,678 380,607,008 846,337,868
0.8705 0.8300 -0.0405 32,804,682 179,648,350 380,607,008 593,060,040
0.8652 0.8200 -0.0452 3,930,889 1,179,106 380,607,008 385,717,003
0.9041 0.8900 -0.0141 76,657,035 380,607,008 20,702,561 477,966,604
0.9054 0.8600 -0.0454 26,926,860 380,607,008 15,259,842 422,793,710
0.8932 0.8600 -0.0332 34,686,229 380,607,008 5,015,152 420,308,389
0.8863 0.8500 -0.0363 18,431,326 380,607,008 11,764,550 410,802,884
0.8971 0.8400 -0.0571 33,801,723 380,607,008 12,300,137 426,708,868
0.9070 0.8400 -0.0670 102,545,869 380,607,008 142,727,850 625,880,727
0.8944 0.8300 -0.0644 32,804,682 380,607,008 3,301,596 416,713,286