mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

ERROR - `task.dependencies` references tasks that expires before `task.deadline` #691

Closed eu9ene closed 1 week ago

eu9ene commented 1 week ago

It seems the issue with the deadline is even more serious. My train action with caching fails now:

https://firefox-ci-tc.services.mozilla.com/tasks/DQkkrHmNQP6dSjg5p_YYfQ/runs/0

I'm running it from this PR push task: https://github.com/mozilla/firefox-translations-training/pull/690 With this config, which is a regular way to use the cached tasks:

# The initial configuration was generated using:
# task config-generator -- en uk --name spring-2024
#
# The documentation for this config can be found here:
# https://github.com/mozilla/firefox-translations-training/blob/001d15d5f2775e0e4c57717057a1906069e29dcc/taskcluster/configs/config.prod.yml
experiment:
  name: spring-2024
  src: en
  trg: uk
  best-model: chrf
  use-opuscleaner: 'true'
  opuscleaner-mode: defaults
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  mono-max-sentences-src: 500_000_000
  mono-max-sentences-trg: 200_000_000
  spm-sample-size: 10_000_000
  spm-vocab-size: 32000
  teacher-ensemble: 2
  teacher-mode: two-stage
  pretrained-models: {}
datasets:

  # Skipped test/devtest datasets:
  devtest:
  - flores_aug-mix_dev
  - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-ukr
  test:
  - flores_devtest
  - flores_aug-mix_devtest
  - flores_aug-title_devtest
  - flores_aug-upper_devtest
  - flores_aug-typos_devtest
  - flores_aug-noise_devtest
  - flores_aug-inline-noise_devtest
  - mtdata_Neulab-tedtalks_test-1-eng-ukr

  # The training data contains:
  #   58,968,083 sentences
  #
  # Skipped datasets:
  #  - opus_CCMatrix/v1 - ignored datasets (20,240,171 sentences)
  #  - opus_MultiMaCoCu/v2 - ignored datasets (6,406,288 sentences)
  #  - opus_GNOME/v1 - not enough data  (150 sentences)
  #  - opus_Ubuntu/v14.10 - not enough data  (0 sentences)
  #  - mtdata_ELRC-wikipedia_health-1-eng-ukr - duplicate with opus
  #  - mtdata_Facebook-wikimatrix-1-eng-ukr - duplicate with opus
  #  - mtdata_Neulab-tedtalks_train-1-eng-ukr - duplicate with opus
  #  - mtdata_Statmt-ccaligned-1-eng-ukr_UA - duplicate with opus
  train:
  - opus_NLLB/v1  #                                       20,240,171 sentences
  - opus_ParaCrawl/v9 #                                  14,079,832 sentences
  - opus_CCAligned/v1 #                                   8,547,377 sentences
  - opus_MaCoCu/v2 #                                      6,406,294 sentences
  - opus_XLEnt/v1.2 #                                     3,671,061 sentences
  - opus_SUMMA/v1 #                                       1,574,611 sentences
  - opus_OpenSubtitles/v2018 #                              877,780 sentences
  - opus_wikimedia/v20230407 #                              757,910 sentences
  - opus_WikiMatrix/v1 #                                    681,115 sentences
  - opus_ELRC-5214-A_Lexicon_Named/v1 #                     495,403 sentences
  - opus_ELRC-5183-SciPar_Ukraine/v1 #                      306,813 sentences
  - opus_KDE4/v2 #                                          233,611 sentences
  - opus_QED/v2.0a #                                        215,630 sentences
  - opus_TED2020/v1 #                                       208,141 sentences
  - opus_Tatoeba/v2023-04-12 #                              175,502 sentences
  - opus_ELRC-5179-acts_Ukrainian/v1 #                      129,942 sentences
  - opus_ELRC-5180-Official_Parliament_/v1 #                116,260 sentences
  - opus_NeuLab-TedTalks/v1 #                               115,474 sentences
  - opus_ELRC-5181-Official_Parliament_/v1 #                 61,012 sentences
  - opus_ELRC-5174-French_Polish_Ukrain/v1 #                 36,228 sentences
  - opus_bible-uedin/v1 #                                    15,901 sentences
  - opus_ELRC-5182-Official_Parliament_/v1 #                  8,800 sentences
  - opus_ELRC-3043-wikipedia_health/v1 #                      2,735 sentences
  - opus_ELRC-wikipedia_health/v1 #                           2,735 sentences
  - opus_ELRC_2922/v1 #                                       2,734 sentences
  - opus_EUbookshop/v2 #                                      1,793 sentences
  - opus_TildeMODEL/v2018 #                                   1,628 sentences
  - opus_ELRC-5217-Ukrainian_Legal_MT/v1 #                      997 sentences
  - opus_tldr-pages/v2023-08-29 #                               593 sentences
  - mtdata_Tilde-worldbank-1-eng-ukr #                      ~2,011 sentences (227.3 kB)

  # The monolingual data contains:
  #   ~209,074,237 sentences
  mono-src:
  - news-crawl_news.2007  #          ~1,630,834 sentences (184.3 MB)
  - news-crawl_news.2008 #          ~5,648,654 sentences (638.3 MB)
  - news-crawl_news.2009 #          ~6,879,522 sentences (777.4 MB)
  - news-crawl_news.2010 #          ~3,406,380 sentences (384.9 MB)
  - news-crawl_news.2011 #          ~6,628,308 sentences (749.0 MB)
  - news-crawl_news.2012 #          ~6,715,609 sentences (758.9 MB)
  - news-crawl_news.2013 #         ~11,050,614 sentences (1.2 GB)
  - news-crawl_news.2014 #         ~11,026,051 sentences (1.2 GB)
  - news-crawl_news.2015 #         ~11,182,484 sentences (1.3 GB)
  - news-crawl_news.2016 #          ~8,366,518 sentences (945.4 MB)
  - news-crawl_news.2017 #         ~12,276,499 sentences (1.4 GB)
  - news-crawl_news.2018 #          ~8,303,432 sentences (938.3 MB)
  - news-crawl_news.2019 #         ~19,386,668 sentences (2.2 GB)
  - news-crawl_news.2020 #         ~24,070,652 sentences (2.7 GB)
  - news-crawl_news.2021 #         ~23,139,914 sentences (2.6 GB)
  - news-crawl_news.2022 #         ~24,900,055 sentences (2.8 GB)
  - news-crawl_news.2023 #         ~24,462,043 sentences (2.8 GB)

  # The monolingual data contains:
  #   ~1,940,719 sentences
  mono-trg:
  - news-crawl_news.2008  #              ~6,213 sentences (702.1 kB)
  - news-crawl_news.2009 #             ~31,947 sentences (3.6 MB)
  - news-crawl_news.2010 #              ~6,663 sentences (753.0 kB)
  - news-crawl_news.2011 #             ~61,690 sentences (7.0 MB)
  - news-crawl_news.2012 #             ~71,172 sentences (8.0 MB)
  - news-crawl_news.2013 #             ~86,086 sentences (9.7 MB)
  - news-crawl_news.2014 #             ~92,031 sentences (10.4 MB)
  - news-crawl_news.2015 #            ~106,485 sentences (12.0 MB)
  - news-crawl_news.2016 #             ~41,342 sentences (4.7 MB)
  - news-crawl_news.2018 #             ~88,962 sentences (10.1 MB)
  - news-crawl_news.2019 #            ~203,060 sentences (22.9 MB)
  - news-crawl_news.2020 #            ~221,669 sentences (25.0 MB)
  - news-crawl_news.2021 #            ~220,114 sentences (24.9 MB)
  - news-crawl_news.2022 #            ~247,965 sentences (28.0 MB)
  - news-crawl_news.2023 #            ~455,320 sentences (51.5 MB)
marian-args:
  decoding-backward:
    beam-size: '12'
    mini-batch-words: '2000'
  decoding-teacher:
    mini-batch-words: '4000'
    precision: float16
  training-backward:
    early-stopping: '5'
  training-teacher:
    early-stopping: '20'
  training-student:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
target-stage: all
start-stage: alignments-student
previous_group_ids: ["a5F1swqWQhKp88A4aiJ3wg"]
wandb-publication: true
taskcluster:
  split-chunks: 20
  worker-classes:
    default: gcp-spot
eu9ene commented 1 week ago

Copying discussion from Slack related to the sudden rebuilding of the toolchains.

@bhearsum:

We have a guard in taskgraph that will not use a cached task if that task expires before the deadline of a task that depends on it. This is to ensure that any tasks upstream of the ones we're about to create will be available at all possible times the tasks could run. In this case, this task is the one we should've re-used, whose expiry is 2024-08-14T11:28:43.692Z. One of the tasks that wanted it ended up being created with a deadline of 2024-08-18T18:26:25.727Z - 4 days after the cached task expires. For the short term, I suggest you reduce the default task deadline such that it will be before 2024-08-14. If you do that and try again with the same training config, I expect you'll pick up the tasks you expect. For the medium term, we should lengthen the expiry time of toolchain, and perhaps other tasks. And it may be a good idea to force rebuilds at the start of big trainings to make sure we don't depend on any cached tasks that may be expiring soon.

eu9ene commented 1 week ago

Reducing the deadline to 10 days helped.

eu9ene commented 1 week ago

Closing this since we've landed a workaround but I guess it doesn't solve the problem for some cases. Let's reopen if we see it again in the future.