Hypothesis

We can improve training quality by tuning filtering rules, using better cleaning models etc.

Experiment insights

Spreadsheet with analysis for specific filtering rules.

Some insights for specific runs are in this spreadsheet.

OpusCleaner

legacy cleaning slightly outperforms all OpusCleaner configs (likely due to num_mismatch filter in OpusCleaner)
large FastText model significantly reduces false positives compared to small one
FastText can remove a lot of useful data on cleaner datasets, especially short phrases
alpha ratio filter can remove useful data on cleaner datasets
custom OpusCleaner configs slightly outperform the default one
custom OpusCleaner configs + bicleaner significantly outperform the default one + bicleaner (+5M useful sentences due to removing some cleaning rules)

OpusFilter:

a similar to OpusCleaner config in OpusFilter with auto-tuning performs a lot worse than the OpusCleaner one (likely due to the difference in filters)
OpusFilter with LASER and autotuning performs better than without it but still worse than OpusCleaner (Helsinki folks pointed out that there's a bug in sampling with LASER)
Autotuning with only basic OpusCleaner like filters (no bicleaner or laser) performs better than the OpusCleaner like defaults and better than autotuning with disabled feature selection. Mostly because it trained longer and had more data
Autotuning with enabled LASER and BicleanerAI filters way too much data and underperforms
Autotuned and defaults based OpusCleaner like rules do not outperform OpusCleaner defaults baseline (likely difference in fast text implementation)
(TODO) tune laser and bicleaner separately

Bicleaner AI

I deployed OpusCleaner on GPU with Bicleaner AI support, it's a little slow but works
it's very hard to tune bicleaner thresholds in OpusCleaner
Manual analysis of score distributions and example in Jupyter show that even with 0.9 there are plenty of incorrect translations
Experiment with 0.5 vs 0.8 vs 0.9 for all datasets. 0.8 slightly outperforms 0.5, 0.9 filters too much but also competitive

LASER

also hard to tune in OpusCleaner
LASER 2/3 is slower than LASER 1, requires GPU

Setup

en-ru pair, all data except CCMatrix/NLLB, training backward model (ru-en)

Example config:


datasets:
  # all except ccmatrix and nllb to test filtering
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_ELRC-3075-wikipedia_health/v1
    - opus_ELRC-3855-SWPS_University_Soci/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-5183-SciPar_Ukraine/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC_2922/v1
    - opus_EUbookshop/v2
    - opus_GNOME/v1
    - opus_GlobalVoices/v2018q4
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_News-Commentary/v16
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2013/v1.1
    - opus_TED2020/v1
    - opus_Tanzil/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_UNPC/v1.0
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_WikiTitles/v3
    - opus_Wikipedia/v1.0
    - opus_XLEnt/v1.2
    - opus_ada83/v1
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_tico-19/v2020-10-28
    - opus_tldr-pages/v2023-08-29
    - opus_wikimedia/v20230407
    - mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
    - mtdata_Statmt-news_commentary_wmt18-13-rus-eng
    - mtdata_Tilde-airbaltic-1-eng-rus
    - mtdata_Tilde-czechtourism-1-eng-rus
    - mtdata_Tilde-worldbank-1-eng-rus
    - mtdata_UN-un_dev-1-eng-rus
    - mtdata_UN-un_test-1-eng-rus
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_aug-mix_wmt19
    - sacrebleu_aug-mix_wmt17
    - sacrebleu_aug-mix_wmt15
    - sacrebleu_aug-mix_wmt14
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt20
    - sacrebleu_wmt20
    - sacrebleu_wmt18
    - sacrebleu_wmt16
    - sacrebleu_wmt13
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2008
  # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2008
experiment:
  src: en
  trg: ru
  name: opuscleaner_custom_laser_bicleaner
  vocab: NOT-YET-SUPPORTED
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  best-model: chrf
  split-length: 2000000
  backward-model: NOT-YET-SUPPORTED
  spm-sample-size: 10000000
  spm-vocab-size: 32000
  teacher-ensemble: 1
  mono-max-sentences-src: 500000000
  mono-max-sentences-trg: 500000000
  use-opuscleaner: 'true'
marian-args:
  decoding-teacher:
    precision: float16
    mini-batch-words: '4000'
  training-student:
    early-stopping: '20'
  decoding-backward:
    beam-size: '8'
    mini-batch-words: '2000'
  training-backward:
    after: 10e
  training-teacher:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
taskcluster:
  split-chunks: 10
target-stage: train-backwards

mozilla / translations

Experiment with data cleaning #814