mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Experiment with data cleaning #814

Closed eu9ene closed 2 months ago

eu9ene commented 2 months ago

Hypothesis

We can improve training quality by tuning filtering rules, using better cleaning models etc.

Experiment insights

Spreadsheet with analysis for specific filtering rules.

Some insights for specific runs are in this spreadsheet.

OpusCleaner

OpusFilter:

Bicleaner AI

LASER

More questions to explore:

LASER embedding similarity filter:

Bilcleaner-AI:

Setup

en-ru pair, all data except CCMatrix/NLLB, training backward model (ru-en)

Example config:


datasets:
  # all except ccmatrix and nllb to test filtering
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_ELRC-3075-wikipedia_health/v1
    - opus_ELRC-3855-SWPS_University_Soci/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-5183-SciPar_Ukraine/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC_2922/v1
    - opus_EUbookshop/v2
    - opus_GNOME/v1
    - opus_GlobalVoices/v2018q4
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_News-Commentary/v16
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2013/v1.1
    - opus_TED2020/v1
    - opus_Tanzil/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_UNPC/v1.0
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_WikiTitles/v3
    - opus_Wikipedia/v1.0
    - opus_XLEnt/v1.2
    - opus_ada83/v1
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_tico-19/v2020-10-28
    - opus_tldr-pages/v2023-08-29
    - opus_wikimedia/v20230407
    - mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
    - mtdata_Statmt-news_commentary_wmt18-13-rus-eng
    - mtdata_Tilde-airbaltic-1-eng-rus
    - mtdata_Tilde-czechtourism-1-eng-rus
    - mtdata_Tilde-worldbank-1-eng-rus
    - mtdata_UN-un_dev-1-eng-rus
    - mtdata_UN-un_test-1-eng-rus
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_aug-mix_wmt19
    - sacrebleu_aug-mix_wmt17
    - sacrebleu_aug-mix_wmt15
    - sacrebleu_aug-mix_wmt14
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt20
    - sacrebleu_wmt20
    - sacrebleu_wmt18
    - sacrebleu_wmt16
    - sacrebleu_wmt13
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2008
  # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2008
experiment:
  src: en
  trg: ru
  name: opuscleaner_custom_laser_bicleaner
  vocab: NOT-YET-SUPPORTED
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  best-model: chrf
  split-length: 2000000
  backward-model: NOT-YET-SUPPORTED
  spm-sample-size: 10000000
  spm-vocab-size: 32000
  teacher-ensemble: 1
  mono-max-sentences-src: 500000000
  mono-max-sentences-trg: 500000000
  use-opuscleaner: 'true'
marian-args:
  decoding-teacher:
    precision: float16
    mini-batch-words: '4000'
  training-student:
    early-stopping: '20'
  decoding-backward:
    beam-size: '8'
    mini-batch-words: '2000'
  training-backward:
    after: 10e
  training-teacher:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
taskcluster:
  split-chunks: 10
target-stage: train-backwards