hplt-project / OpusTrainer

Curriculum training
https://pypi.org/project/opustrainer/
MIT License
15 stars 5 forks source link

KeyError: '٦' with typos modifier #40

Closed eu9ene closed 10 months ago

eu9ene commented 10 months ago

This happens while training a backward model for lt-en. If I remove the typos modified, the problem goes away.

[task 2023-10-30T16:49:28.112Z] [2023-10-30 16:49:28] [memory] Reserving 95 MB, device gpu1
[task 2023-10-30T16:51:44.995Z] Traceback (most recent call last):
[task 2023-10-30T16:51:44.995Z]   File "/home/ubuntu/.local/bin/opustrainer-train", line 8, in <module>
[task 2023-10-30T16:51:44.995Z]     sys.exit(main())
[task 2023-10-30T16:51:44.995Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 858, in main
[task 2023-10-30T16:51:44.995Z]     for batch in state_tracker.run(trainer, batch_size=args.batch_size, chunk_size=args.chunk_size, processes=args.workers):
[task 2023-10-30T16:51:44.995Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 779, in run
[task 2023-10-30T16:51:45.003Z]     for batch in trainer.run(*args, **kwargs):
[task 2023-10-30T16:51:45.003Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/trainer.py", line 721, in run
[task 2023-10-30T16:51:45.003Z]     batch = pool.map(batch, chunk_size)
[task 2023-10-30T16:51:45.003Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/opustrainer/modifiers/pool.py", line 136, in map
[task 2023-10-30T16:51:45.003Z]     raise exc
[task 2023-10-30T16:51:45.003Z] KeyError: '٦'

Failed task: https://firefox-ci-tc.services.mozilla.com/tasks/HBOXDKykSoGwm6iHcKN5jw Training log: https://firefoxci.taskcluster-artifacts.net/HBOXDKykSoGwm6iHcKN5jw/0/public/logs/live_backing.log Training corpus: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/C_CXBDwhScC8Gzy7J9iJhw/runs/0/artifacts/public%2Fbuild%2Fcorpus.en.zst https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/C_CXBDwhScC8Gzy7J9iJhw/runs/0/artifacts/public%2Fbuild%2Fcorpus.lt.zst

Opus trainer config:

datasets:
  original: /home/ubuntu/tasks/task_169868422726590/fetches/corpus.enlt.tsv # Original parallel corpus

stages:
  - train

train:
  - original 1.0
  - until original inf # General training until marian early stops

modifiers:
- UpperCase: 0.05 # Apply randomly to 5% of sentences
- TitleCase: 0.05
- Typos: 0.05

seed: 1111
num_fields: 2

Training config:


datasets:
  # parallel training corpus
  train:
    - opus_CCAligned/v1
    - opus_DGT/v2019
    - opus_ECB/v1
    - opus_ECDC/v2016-03-16
    - opus_ELITR-ECA/v1
    - opus_ELRA-W0160/v1
    - opus_ELRC-2021-EUIPO_2017/v1
    - opus_ELRC-2717-EMEA/v1
    - opus_ELRC-2740-vaccination/v1
    - opus_ELRC-2878-EU_publications_medi/v1
    - opus_ELRC-3205-antibiotic/v1
    - opus_ELRC-3296-EUROPARL_covid/v1
    - opus_ELRC-3467-EC_EUROPA_covid/v1
    - opus_ELRC-3568-EUR_LEX_covid/v1
    - opus_ELRC-3609-presscorner_covid/v1
    - opus_ELRC-405-President_Lithuania/v1
    - opus_ELRC-425-Lithuanian_legislati/v1
    - opus_ELRC-4270-NTEU_TierA/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-590-www.lrs.lt/v1
    - opus_ELRC-591-www.lb.lt/v1
    - opus_ELRC-592-kam.lt/v1
    - opus_ELRC-EC_EUROPA/v1
    - opus_ELRC-EMEA/v1
    - opus_ELRC-EUIPO_2017/v1
    - opus_ELRC-EUROPARL_covid/v1
    - opus_ELRC-EUR_LEX/v1
    - opus_ELRC-EU_publications/v1
    - opus_ELRC-antibiotic/v1
    - opus_ELRC-presscorner_covid/v1
    - opus_ELRC-vaccination/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC_2922/v1
    - opus_ELRC_2923/v1
    - opus_ELRC_3382/v1
    - opus_EMEA/v3
    - opus_EUbookshop/v2
    - opus_EUconst/v1
    - opus_Europarl/v8
    - opus_GNOME/v1
    - opus_JRC-Acquis/v3.0
    - opus_KDE4/v2
    - opus_NLLB/v1
    - opus_NeuLab-TedTalks/v1
    - opus_OpenSubtitles/v2018
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2020/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_XLEnt/v1.2
    - opus_bible-uedin/v1
    - opus_wikimedia/v20230407
    - mtdata_EU-dcep-1-eng-lit
    - mtdata_EU-eac_forms-1-eng-lit
    - mtdata_EU-eac_reference-1-eng-lit
    - mtdata_EU-ecdc-1-eng-lit
    - mtdata_Statmt-wiki_titles-1-lit-eng
  # datasets to merge for validation while training
  devtest:
    - flores_aug-mix_dev
    - sacrebleu_aug-mix_wmt19/dev
    - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-lit
  # datasets for evaluation
  test:
    - flores_devtest
    - flores_aug-mix_devtest
    - flores_aug-title_devtest
    - flores_aug-title-strict_devtest
    - flores_aug-upper_devtest
    - flores_aug-upper-strict_devtest
    - flores_aug-typos_devtest
    - sacrebleu_wmt19
    - sacrebleu_aug-mix_wmt19
    - sacrebleu_aug-title_wmt19
    - sacrebleu_aug-title-strict_wmt19
    - sacrebleu_aug-upper_wmt19
    - sacrebleu_aug-upper-strict_wmt19
    - sacrebleu_aug-typos_wmt19
    - mtdata_Neulab-tedtalks_test-1-eng-lit
    - mtdata_aug-mix_Neulab-tedtalks_test-1-eng-lit
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2022
    - news-crawl_news.2021
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
   # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2022
    - news-crawl_news.2021
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
    - news-crawl_news.2017
    - news-crawl_news.2016
    - news-crawl_news.2015
    - news-crawl_news.2014
    - news-crawl_news.2013
    - news-crawl_news.2012
    - news-crawl_news.2011
    - news-crawl_news.2010
    - news-crawl_news.2009
    - news-crawl_news.2008
    - news-crawl_news.2007
experiment:
  src: lt
  trg: en
  name: opustrainer
  vocab: NOT-YET-SUPPORTED
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds:
      opus_CCAligned/v1: 0.7
      opus_OpenSubtitles/v2018: 0.8
      opus_ParaCrawl/v9: 0
      opus_WikiMatrix/v1: 0.7
      mtdata_Statmt-wiki_titles-1-lit-eng: 0.7
      opus_bible-uedin/v1: 0.7
  best-model: chrf
  split-length: 2000000
  backward-model: NOT-YET-SUPPORTED
  spm-sample-size: 10000000
  spm-vocab-size: 32000
  teacher-ensemble: 2
  mono-max-sentences-src: 500000000
  mono-max-sentences-trg: 500000000
  use-opuscleaner: 'false'
marian-args:
  decoding-teacher:
    precision: float16
    mini-batch-words: '4000'
  training-student:
    early-stopping: '20'
  decoding-backward:
    beam-size: '12'
    mini-batch-words: '2000'
  training-backward:
    after: 10e
  training-teacher-base:
    after: 2e
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
  training-teacher-finetuned:
    early-stopping: '20'
taskcluster:
  split-chunks: 10
target-stage: all
XapaJIaMnu commented 10 months ago

@jelmervdl

jelmervdl commented 10 months ago
Traceback (most recent call last):
  File "/Users/jelmer/Workspace/statmt/empty-trainer/tests/test_typos.py", line 95, in test_regression_40
    self.assertNotEqual(next(iter(modifier([line]))), line)
  File "/Users/jelmer/Workspace/statmt/empty-trainer/src/opustrainer/modifiers/typos.py", line 200, in __call__
    yield self.apply(line)
  File "/Users/jelmer/Workspace/statmt/empty-trainer/src/opustrainer/modifiers/typos.py", line 230, in apply
    getattr(data, modifier)()
  File "/Users/jelmer/.virtualenvs/opustrainer/lib/python3.8/site-packages/typo/Errer.py", line 69, in extra_char
    char_to_add = en_default.get_random_neighbor(trigger_char)
  File "/Users/jelmer/.virtualenvs/opustrainer/lib/python3.8/site-packages/typo/keyboardlayouts/en_default.py", line 118, in get_random_neighbor
    return random.choice(NEIGHBORINGNUMPADDIGITS[char])
KeyError: '٦'