Support monolingual data from OPUS

bhearsum commented 11 months ago

@gregtatum has been trying to use an opus dataset as one of the mono datasets. For example, with this training config:

datasets:
  devtest:
    - flores_devtest
  mono-src:
    - opus_MaCoCu/v2
  mono-trg:
    - news-crawl_news.2021
    - news-crawl_news.2020
    - news-crawl_news.2019
    - news-crawl_news.2018
    - news-crawl_news.2017
    - news-crawl_news.2016
    - news-crawl_news.2015
    - news-crawl_news.2014
    - news-crawl_news.2013
    - news-crawl_news.2012
    - news-crawl_news.2011
    - news-crawl_news.2010
    - news-crawl_news.2009
    - news-crawl_news.2008
    - news-crawl_news.2007
  test:
    - flores_dev
  train:
    - opus_NLLB/v1
    - opus_OpenSubtitles/v2018
experiment:
  backward-model: NOT-YET-SUPPORTED
  best-model: chrf
  bicleaner:
    dataset-thresholds: {}
    default-threshold: 0.1
  mono-max-sentences-src: 100000000
  mono-max-sentences-trg: 20000000
  name: baseline_en_ca
  split-length: 2000000
  spm-sample-size: 10000000
  src: en
  teacher-ensemble: 2
  trg: ca
  use-opuscleaner: 'true'
  vocab: NOT-YET-SUPPORTED
marian-args:
  decoding-backward:
    beam-size: '12'
    mini-batch-words: '2000'
  decoding-teacher:
    mini-batch-words: '4000'
    precision: float16
  training-backward:
    early-stopping: '5'
  training-student:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
  training-teacher:
    early-stopping: '30'
target-stage: all
taskcluster:
  split-chunks: 10

Note the opus_MaCoCu/v2 in mono-src.

When run, we end up with:

Traceback (most recent call last):
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/main.py", line 900, in main
    return args.command(vars(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/main.py", line 430, in show_taskgraph
    ret = generate_taskgraph(options, parameters, logdir)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/main.py", line 192, in generate_taskgraph
    out = format_taskgraph(options, spec, logfile(spec))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/main.py", line 148, in format_taskgraph
    tg = getattr(tgg, options["graph_attr"])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/generator.py", line 188, in target_task_graph
    return self._run_until("target_task_graph")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/generator.py", line 425, in _run_until
    k, v = next(self._run)
           ^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/generator.py", line 311, in _run
    new_tasks = kind.load_tasks(
                ^^^^^^^^^^^^^^^^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in load_tasks
    tasks = [
            ^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in <listcomp>
    tasks = [
            ^
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1339, in check_run_task_caches
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1268, in check_task_dependencies
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1254, in check_task_identifiers
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1235, in chain_of_trust
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1228, in add_github_checks
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1057, in build_task
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1025, in add_index_routes
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 953, in process_treeherder_metadata
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 915, in validate
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 903, in task_name_from_label
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 859, in set_defaults
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 838, in set_implementation
    for task in tasks:
  File "/home/bhearsum/repos/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 106, in cache_task
    for task in order_tasks(config, tasks):
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/cached_tasks.py", line 22, in order_tasks
    pending = deque(tasks)
              ^^^^^^^^^^^^
  File "/home/bhearsum/repos/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 58, in add_cache
    for job in jobs:
  File "/home/bhearsum/repos/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 44, in resolved_keyed_by_fields
    for job in jobs:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/job/__init__.py", line 363, in make_task_description
    for job in jobs:
  File "/home/bhearsum/.pyenv/versions/3.11.6/envs/taskgraph/lib/python3.11/site-packages/taskgraph/transforms/job/__init__.py", line 293, in use_fetches
    raise Exception(
Exception: clean-mono-opus-en-MaCoCu_v2-mono-src can't fetch opus-en artifacts because there are no tasks with label dataset-opus-MaCoCu_v2-en in kind dependencies!

This is because over in the dataset kind we name opus datasets with both the src and trg locale in their name, while the clean-mono kind is looking for something named with just src or trg.

The obvious thing to do is to always name all datasets with both src and trg locales...but the fact that we have some datasets that are monolingual make this a non-starter (I think). (I gave this a quick try and ended up with some complaints about news-crawl instead at least...)

My horrible hack to unstick this in the short term was:

diff --git a/taskcluster/translations_taskgraph/transforms/from_datasets.py b/taskcluster/translations_taskgraph/transforms/from_datasets.py
index edd277b..23795a1 100644
--- a/taskcluster/translations_taskgraph/transforms/from_datasets.py
+++ b/taskcluster/translations_taskgraph/transforms/from_datasets.py
@@ -121,19 +121,17 @@ def jobs_for_mono_datasets(config, jobs):

         for full_dataset in included_datasets:
             dataset_provider, dataset = full_dataset.split("_", 1)
             if provider and provider != dataset_provider:
                 continue

             subjob = copy.deepcopy(job)

-            if dataset_provider == "opus":
-                locale = f"{src}-{trg}"
-            elif category == "mono-src":
+            if category == "mono-src":
                 locale = src
             elif category == "mono-trg":
                 locale = trg
             else:
                 raise Exception(
                     "from_datasets:mono can only be used with mono-src and mono-trg categories"
                 )

(This is not remotely landable or good - I'm mainly putting it here for future reference.)

eu9ene commented 10 months ago

It's not really a bug. We don't support monolingual datasets from OPUS. I think they started adding them recently. Also we have plenty of data for back-translation for English from news-crawl and for other languages if they are on OPUS as parallel data we'll want to use it as train and not mono.

gregtatum commented 6 months ago

Looking at this again, the NLLB dataset can have language pairs that don't include English. For instance in Catalan, there are 21M sentences that are en-ca, while 65M for es-ca. I could see using the Catalan side of this language pair for monolingual data. I don't see anywhere that someone has built a dataset of monolingual data from NLLB.

mozilla / translations

Support monolingual data from OPUS #286