legacy cleaning slightly outperforms all OpusCleaner configs (likely due to num_mismatch filter in OpusCleaner)
large FastText model significantly reduces false positives compared to small one
FastText can remove a lot of useful data on cleaner datasets, especially short phrases
alpha ratio filter can remove useful data on cleaner datasets
custom OpusCleaner configs slightly outperform the default one
custom OpusCleaner configs + bicleaner significantly outperform the default one + bicleaner (+5M useful sentences due to removing some cleaning rules)
OpusFilter:
a similar to OpusCleaner config in OpusFilter with auto-tuning performs a lot worse than the OpusCleaner one (likely due to the difference in filters)
OpusFilter with LASER and autotuning performs better than without it but still worse than OpusCleaner (Helsinki folks pointed out that there's a bug in sampling with LASER)
Autotuning with only basic OpusCleaner like filters (no bicleaner or laser) performs better than the OpusCleaner like defaults and better than autotuning with disabled feature selection. Mostly because it trained longer and had more data
Autotuning with enabled LASER and BicleanerAI filters way too much data and underperforms
Autotuned and defaults based OpusCleaner like rules do not outperform OpusCleaner defaults baseline (likely difference in fast text implementation)
(TODO) tune laser and bicleaner separately
Bicleaner AI
I deployed OpusCleaner on GPU with Bicleaner AI support, it's a little slow but works
it's very hard to tune bicleaner thresholds in OpusCleaner
Manual analysis of score distributions and example in Jupyter show that even with 0.9 there are plenty of incorrect translations
Experiment with 0.5 vs 0.8 vs 0.9 for all datasets. 0.8 slightly outperforms 0.5, 0.9 filters too much but also competitive
LASER
also hard to tune in OpusCleaner
LASER 2/3 is slower than LASER 1, requires GPU
More questions to explore:
LASER embedding similarity filter:
What's the impact of LASER filter?
Can LASER be useful together with Bicleaner-AI?
Does LASER 2/3 significantly outperform LASER 1?
Bilcleaner-AI:
Will customizing the thresholds for large datasets boost performance?
Setup
en-ru pair, all data except CCMatrix/NLLB, training backward model (ru-en)
Example config:
datasets:
# all except ccmatrix and nllb to test filtering
train:
- opus_Books/v1
- opus_CCAligned/v1
- opus_ELRC-3075-wikipedia_health/v1
- opus_ELRC-3855-SWPS_University_Soci/v1
- opus_ELRC-5067-SciPar/v1
- opus_ELRC-5183-SciPar_Ukraine/v1
- opus_ELRC-wikipedia_health/v1
- opus_ELRC_2922/v1
- opus_EUbookshop/v2
- opus_GNOME/v1
- opus_GlobalVoices/v2018q4
- opus_KDE4/v2
- opus_LinguaTools-WikiTitles/v2014
- opus_NeuLab-TedTalks/v1
- opus_News-Commentary/v16
- opus_OpenSubtitles/v2018
- opus_PHP/v1
- opus_ParaCrawl/v9
- opus_QED/v2.0a
- opus_TED2013/v1.1
- opus_TED2020/v1
- opus_Tanzil/v1
- opus_Tatoeba/v2023-04-12
- opus_TildeMODEL/v2018
- opus_UNPC/v1.0
- opus_Ubuntu/v14.10
- opus_WikiMatrix/v1
- opus_WikiTitles/v3
- opus_Wikipedia/v1.0
- opus_XLEnt/v1.2
- opus_ada83/v1
- opus_bible-uedin/v1
- opus_infopankki/v1
- opus_tico-19/v2020-10-28
- opus_tldr-pages/v2023-08-29
- opus_wikimedia/v20230407
- mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
- mtdata_Statmt-news_commentary_wmt18-13-rus-eng
- mtdata_Tilde-airbaltic-1-eng-rus
- mtdata_Tilde-czechtourism-1-eng-rus
- mtdata_Tilde-worldbank-1-eng-rus
- mtdata_UN-un_dev-1-eng-rus
- mtdata_UN-un_test-1-eng-rus
# datasets to merge for validation while training
devtest:
- flores_dev
- sacrebleu_aug-mix_wmt19
- sacrebleu_aug-mix_wmt17
- sacrebleu_aug-mix_wmt15
- sacrebleu_aug-mix_wmt14
# datasets for evaluation
test:
- flores_devtest
- sacrebleu_wmt20
- sacrebleu_wmt20
- sacrebleu_wmt18
- sacrebleu_wmt16
- sacrebleu_wmt13
# monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
# to be translated by the teacher model
mono-src:
- news-crawl_news.2008
# to be translated by the backward model to augment teacher corpus with back-translations
# leave empty to skip augmentation step (high resource languages)
mono-trg:
- news-crawl_news.2008
experiment:
src: en
trg: ru
name: opuscleaner_custom_laser_bicleaner
vocab: NOT-YET-SUPPORTED
bicleaner:
default-threshold: 0.5
dataset-thresholds: {}
best-model: chrf
split-length: 2000000
backward-model: NOT-YET-SUPPORTED
spm-sample-size: 10000000
spm-vocab-size: 32000
teacher-ensemble: 1
mono-max-sentences-src: 500000000
mono-max-sentences-trg: 500000000
use-opuscleaner: 'true'
marian-args:
decoding-teacher:
precision: float16
mini-batch-words: '4000'
training-student:
early-stopping: '20'
decoding-backward:
beam-size: '8'
mini-batch-words: '2000'
training-backward:
after: 10e
training-teacher:
early-stopping: '20'
training-student-finetuned:
early-stopping: '20'
taskcluster:
split-chunks: 10
target-stage: train-backwards
Hypothesis
We can improve training quality by tuning filtering rules, using better cleaning models etc.
Experiment insights
Spreadsheet with analysis for specific filtering rules.
Some insights for specific runs are in this spreadsheet.
OpusCleaner
OpusFilter:
Bicleaner AI
LASER
More questions to explore:
LASER embedding similarity filter:
Bilcleaner-AI:
Setup
en-ru
pair, all data except CCMatrix/NLLB, training backward model (ru-en
)Example config: