eu9ene commented 10 months ago

After training en-hu we noticed a somewhat larger quality gap in 4 BLEU points between the teacher and student models.

It’s 24.8 for the quantized and fine-tuned student vs 30.2 BLEU for the teacher enseble on flores-test dataset. For example for en-nl we had 27.0 vs 28.3.

The training looks more or less normal, but a little less smooth than usual.

en-hu:

en-nl

We had a larger and probably a higher quality dataset for en-nl.

We should investigate further whether it’s a pipeline issue, a config issue or a data issue.

eu9ene commented 1 month ago

The only idea I have at the moment is to try utilizing more monolingual data that the model hasn't seen. For example, we have pretty small distillation gap for da-en. The training corpus was 161M before filtering and we added 139M monolingual sentences from more diverse sources compared to regular news-crawl that we use for en-xx pairs:

  # The monolingual data contains:
  #   ~139,436,127 sentences
  mono-src:
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_1.txt.zst  # 65,099,327 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_2.txt.zst # 16,579,852 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-nllb/nllb-mono-da.txt.zst # 57,756,948 sentences

For en-lt we had 76M sentences in the original training corpus and regular 200M of English news-crawl:

  # The monolingual data contains:
  #   ~195,823,002 sentences
  mono-src:
  - news-crawl_news.2007  #           ~1,557,522 sentences (176M)
  - news-crawl_news.2008 #           ~5,389,380 sentences (609M)
  - news-crawl_news.2009 #           ~6,557,522 sentences (741M)
  - news-crawl_news.2010 #           ~3,247,787 sentences (367M)
  - news-crawl_news.2011 #           ~6,318,584 sentences (714M)
  - news-crawl_news.2012 #           ~6,407,079 sentences (724M)
  - news-crawl_news.2013 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2014 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2015 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2016 #           ~7,982,300 sentences (902M)
  - news-crawl_news.2017 #          ~11,504,424 sentences (1.3G)
  - news-crawl_news.2018 #           ~7,920,353 sentences (895M)
  - news-crawl_news.2019 #          ~17,699,115 sentences (2.0G)
  - news-crawl_news.2020 #          ~22,123,893 sentences (2.5G)
  - news-crawl_news.2021 #          ~21,238,938 sentences (2.4G)
  - news-crawl_news.2022 #          ~23,008,849 sentences (2.6G)
  - news-crawl_news.2023 #          ~23,008,849 sentences (2.6G)

So we could mine some mono data for Enslish from HPLT and NLLB. Then we can increase and diversify the mono set for distillation.

See the paper From Research to Production and Back: Ludicrously Fast Neural Machine Translation

2.1 Knowledge distillation with noisy backward-forward translation In our experience, student training benefits from forward-translated data that was not seen during teacher training. Since we do not have access to additional monolingual source data, we generate noisy back-translated sentences (Edunov et al.,2018), one set per inverse teacher model

@marco-c @gregtatum FYI

eu9ene commented 1 month ago

@gregtatum let's prepare NLLB and HPLT data for English in the same way we did for other languages. Then I'll rerun distillation with the more diverse dataset to check the hypothesis. The monolingual shards shouldn't be too big, I'd say ideally not bigger than 50M sentences so that we can add as many as needed to have a good mix.

eu9ene commented 1 month ago

Another hypothesis is that our on-the-fly data augmentation affects quality more for worse teacher models. The experiment here would be to disable OpusTrainer augmentations and see how it performs. It's easier to check but the fix would be very complex as we would need to move augmentations from training to preparing corpus for translation.

gregtatum commented 1 month ago

I created 3 separate issues for different lines of investigation.

gregtatum commented 1 month ago

I did some light analysis of our recent runs, and their distillation gap vs the sentence counts.

https://docs.google.com/spreadsheets/d/1l459Ui9J7ccdP6UMd1qDy51L8Uar2aZWbYWxOGcQqXA/edit?gid=1859623642#gid=1859623642

Data Source	Correllation
All monolingual data	0.331
Newscrawl	-0.215
HPLT	0.295
NLLB	0.385
HPLT+NLLB	0.421

The amount of data is pretty low but, it's an early signal that HPLT+NLLB contribute more to better distillation, and there is a relationship to more monolingual data to a lower distillation gap.

gregtatum commented 1 month ago

790 Here's another idea on applying fluency similar to HPLT to our translations.

gregtatum commented 2 weeks ago

Data Type	COMET Teacher	COMET Student
Parallel	0.782	0.433
Backtranslations	0.095	–
Distillation	–	0.581
Total Data	0.645	0.647

I did another look at correlations on our most recent runs, excluding English monolingual data from the analysis, as it was all the same dataset, and wouldn't show correlations.

My interpretation here is that parallel data is still the best, followed by distillation data, and finally back translation size is only weakly affecting COMET quality. I'd be curious if back translations would correlate higher with fluency scores rather than just translation quality.

Here is the raw data:

COMET Teacher	COMET Student	Gap - Teacher to Student	Parallel	Distillation	Backtranslation	Total Data
0.9013	0.8900	-0.0113	77,673,571	122,958,567	380,595,455	581,227,593
0.8946	0.8700	-0.0246	67,190,349	296,625,390	380,607,008	744,422,747
0.8934	0.8700	-0.0234	86,586,079	178,623,209	380,607,008	645,816,296
0.8817	0.8600	-0.0217	50,790,274	109,841,918	380,607,008	541,239,200
0.8791	0.8600	-0.0191	27,638,186	23,251,451	380,607,008	431,496,645
0.8979	0.8500	-0.0479	59,206,140	384,244,370	380,607,008	824,057,518
0.8765	0.8500	-0.0265	35,879,023	71,024,429	380,607,008	487,510,460
0.8763	0.8500	-0.0263	35,295,006	60,127,368	380,607,008	476,029,382
0.8757	0.8500	-0.0257	42,547,739	267,167,662	380,607,008	690,322,409
0.8759	0.8400	-0.0359	18,641,618	39,043,989	380,607,008	438,292,615
0.8667	0.8400	-0.0267	33,800,821	74,692,280	380,607,008	489,100,109
0.8871	0.8300	-0.0571	104,589,182	361,141,678	380,607,008	846,337,868
0.8705	0.8300	-0.0405	32,804,682	179,648,350	380,607,008	593,060,040
0.8652	0.8200	-0.0452	3,930,889	1,179,106	380,607,008	385,717,003
0.9041	0.8900	-0.0141	76,657,035	380,607,008	20,702,561	477,966,604
0.9054	0.8600	-0.0454	26,926,860	380,607,008	15,259,842	422,793,710
0.8932	0.8600	-0.0332	34,686,229	380,607,008	5,015,152	420,308,389
0.8863	0.8500	-0.0363	18,431,326	380,607,008	11,764,550	410,802,884
0.8971	0.8400	-0.0571	33,801,723	380,607,008	12,300,137	426,708,868
0.9070	0.8400	-0.0670	102,545,869	380,607,008	142,727,850	625,880,727
0.8944	0.8300	-0.0644	32,804,682	380,607,008	3,301,596	416,713,286

mozilla / firefox-translations-training

Investigate distillation quality gap #231

790 Here's another idea on applying fluency similar to HPLT to our translations.