mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

ERROR - `task.dependencies` references tasks that expires before `task.deadline` #691

Closed eu9ene closed 3 months ago

eu9ene commented 5 months ago

It seems the issue with the deadline is even more serious. My train action with caching fails now:

https://firefox-ci-tc.services.mozilla.com/tasks/DQkkrHmNQP6dSjg5p_YYfQ/runs/0

I'm running it from this PR push task: https://github.com/mozilla/firefox-translations-training/pull/690 With this config, which is a regular way to use the cached tasks:

# The initial configuration was generated using:
# task config-generator -- en uk --name spring-2024
#
# The documentation for this config can be found here:
# https://github.com/mozilla/firefox-translations-training/blob/001d15d5f2775e0e4c57717057a1906069e29dcc/taskcluster/configs/config.prod.yml
experiment:
  name: spring-2024
  src: en
  trg: uk
  best-model: chrf
  use-opuscleaner: 'true'
  opuscleaner-mode: defaults
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  mono-max-sentences-src: 500_000_000
  mono-max-sentences-trg: 200_000_000
  spm-sample-size: 10_000_000
  spm-vocab-size: 32000
  teacher-ensemble: 2
  teacher-mode: two-stage
  pretrained-models: {}
datasets:

  # Skipped test/devtest datasets:
  devtest:
  - flores_aug-mix_dev
  - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-ukr
  test:
  - flores_devtest
  - flores_aug-mix_devtest
  - flores_aug-title_devtest
  - flores_aug-upper_devtest
  - flores_aug-typos_devtest
  - flores_aug-noise_devtest
  - flores_aug-inline-noise_devtest
  - mtdata_Neulab-tedtalks_test-1-eng-ukr

  # The training data contains:
  #   58,968,083 sentences
  #
  # Skipped datasets:
  #  - opus_CCMatrix/v1 - ignored datasets (20,240,171 sentences)
  #  - opus_MultiMaCoCu/v2 - ignored datasets (6,406,288 sentences)
  #  - opus_GNOME/v1 - not enough data  (150 sentences)
  #  - opus_Ubuntu/v14.10 - not enough data  (0 sentences)
  #  - mtdata_ELRC-wikipedia_health-1-eng-ukr - duplicate with opus
  #  - mtdata_Facebook-wikimatrix-1-eng-ukr - duplicate with opus
  #  - mtdata_Neulab-tedtalks_train-1-eng-ukr - duplicate with opus
  #  - mtdata_Statmt-ccaligned-1-eng-ukr_UA - duplicate with opus
  train:
  - opus_NLLB/v1  #                                       20,240,171 sentences
  - opus_ParaCrawl/v9 #                                  14,079,832 sentences
  - opus_CCAligned/v1 #                                   8,547,377 sentences
  - opus_MaCoCu/v2 #                                      6,406,294 sentences
  - opus_XLEnt/v1.2 #                                     3,671,061 sentences
  - opus_SUMMA/v1 #                                       1,574,611 sentences
  - opus_OpenSubtitles/v2018 #                              877,780 sentences
  - opus_wikimedia/v20230407 #                              757,910 sentences
  - opus_WikiMatrix/v1 #                                    681,115 sentences
  - opus_ELRC-5214-A_Lexicon_Named/v1 #                     495,403 sentences
  - opus_ELRC-5183-SciPar_Ukraine/v1 #                      306,813 sentences
  - opus_KDE4/v2 #                                          233,611 sentences
  - opus_QED/v2.0a #                                        215,630 sentences
  - opus_TED2020/v1 #                                       208,141 sentences
  - opus_Tatoeba/v2023-04-12 #                              175,502 sentences
  - opus_ELRC-5179-acts_Ukrainian/v1 #                      129,942 sentences
  - opus_ELRC-5180-Official_Parliament_/v1 #                116,260 sentences
  - opus_NeuLab-TedTalks/v1 #                               115,474 sentences
  - opus_ELRC-5181-Official_Parliament_/v1 #                 61,012 sentences
  - opus_ELRC-5174-French_Polish_Ukrain/v1 #                 36,228 sentences
  - opus_bible-uedin/v1 #                                    15,901 sentences
  - opus_ELRC-5182-Official_Parliament_/v1 #                  8,800 sentences
  - opus_ELRC-3043-wikipedia_health/v1 #                      2,735 sentences
  - opus_ELRC-wikipedia_health/v1 #                           2,735 sentences
  - opus_ELRC_2922/v1 #                                       2,734 sentences
  - opus_EUbookshop/v2 #                                      1,793 sentences
  - opus_TildeMODEL/v2018 #                                   1,628 sentences
  - opus_ELRC-5217-Ukrainian_Legal_MT/v1 #                      997 sentences
  - opus_tldr-pages/v2023-08-29 #                               593 sentences
  - mtdata_Tilde-worldbank-1-eng-ukr #                      ~2,011 sentences (227.3 kB)

  # The monolingual data contains:
  #   ~209,074,237 sentences
  mono-src:
  - news-crawl_news.2007  #          ~1,630,834 sentences (184.3 MB)
  - news-crawl_news.2008 #          ~5,648,654 sentences (638.3 MB)
  - news-crawl_news.2009 #          ~6,879,522 sentences (777.4 MB)
  - news-crawl_news.2010 #          ~3,406,380 sentences (384.9 MB)
  - news-crawl_news.2011 #          ~6,628,308 sentences (749.0 MB)
  - news-crawl_news.2012 #          ~6,715,609 sentences (758.9 MB)
  - news-crawl_news.2013 #         ~11,050,614 sentences (1.2 GB)
  - news-crawl_news.2014 #         ~11,026,051 sentences (1.2 GB)
  - news-crawl_news.2015 #         ~11,182,484 sentences (1.3 GB)
  - news-crawl_news.2016 #          ~8,366,518 sentences (945.4 MB)
  - news-crawl_news.2017 #         ~12,276,499 sentences (1.4 GB)
  - news-crawl_news.2018 #          ~8,303,432 sentences (938.3 MB)
  - news-crawl_news.2019 #         ~19,386,668 sentences (2.2 GB)
  - news-crawl_news.2020 #         ~24,070,652 sentences (2.7 GB)
  - news-crawl_news.2021 #         ~23,139,914 sentences (2.6 GB)
  - news-crawl_news.2022 #         ~24,900,055 sentences (2.8 GB)
  - news-crawl_news.2023 #         ~24,462,043 sentences (2.8 GB)

  # The monolingual data contains:
  #   ~1,940,719 sentences
  mono-trg:
  - news-crawl_news.2008  #              ~6,213 sentences (702.1 kB)
  - news-crawl_news.2009 #             ~31,947 sentences (3.6 MB)
  - news-crawl_news.2010 #              ~6,663 sentences (753.0 kB)
  - news-crawl_news.2011 #             ~61,690 sentences (7.0 MB)
  - news-crawl_news.2012 #             ~71,172 sentences (8.0 MB)
  - news-crawl_news.2013 #             ~86,086 sentences (9.7 MB)
  - news-crawl_news.2014 #             ~92,031 sentences (10.4 MB)
  - news-crawl_news.2015 #            ~106,485 sentences (12.0 MB)
  - news-crawl_news.2016 #             ~41,342 sentences (4.7 MB)
  - news-crawl_news.2018 #             ~88,962 sentences (10.1 MB)
  - news-crawl_news.2019 #            ~203,060 sentences (22.9 MB)
  - news-crawl_news.2020 #            ~221,669 sentences (25.0 MB)
  - news-crawl_news.2021 #            ~220,114 sentences (24.9 MB)
  - news-crawl_news.2022 #            ~247,965 sentences (28.0 MB)
  - news-crawl_news.2023 #            ~455,320 sentences (51.5 MB)
marian-args:
  decoding-backward:
    beam-size: '12'
    mini-batch-words: '2000'
  decoding-teacher:
    mini-batch-words: '4000'
    precision: float16
  training-backward:
    early-stopping: '5'
  training-teacher:
    early-stopping: '20'
  training-student:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
target-stage: all
start-stage: alignments-student
previous_group_ids: ["a5F1swqWQhKp88A4aiJ3wg"]
wandb-publication: true
taskcluster:
  split-chunks: 20
  worker-classes:
    default: gcp-spot
eu9ene commented 5 months ago

Copying discussion from Slack related to the sudden rebuilding of the toolchains.

@bhearsum:

We have a guard in taskgraph that will not use a cached task if that task expires before the deadline of a task that depends on it. This is to ensure that any tasks upstream of the ones we're about to create will be available at all possible times the tasks could run. In this case, this task is the one we should've re-used, whose expiry is 2024-08-14T11:28:43.692Z. One of the tasks that wanted it ended up being created with a deadline of 2024-08-18T18:26:25.727Z - 4 days after the cached task expires. For the short term, I suggest you reduce the default task deadline such that it will be before 2024-08-14. If you do that and try again with the same training config, I expect you'll pick up the tasks you expect. For the medium term, we should lengthen the expiry time of toolchain, and perhaps other tasks. And it may be a good idea to force rebuilds at the start of big trainings to make sure we don't depend on any cached tasks that may be expiring soon.

eu9ene commented 5 months ago

Reducing the deadline to 10 days helped.

eu9ene commented 5 months ago

Closing this since we've landed a workaround but I guess it doesn't solve the problem for some cases. Let's reopen if we see it again in the future.

eu9ene commented 3 months ago

It's back! I'm trying to run this config from this decision task and getting:

[task 2024-07-31T22:42:55.694Z] 2024-07-31 22:42:55,688 - INFO - Creating task with taskId EvcR5CreSt-bejbVc-tAmw for finetune-student-en-uk
[task 2024-07-31T22:42:55.695Z] 2024-07-31 22:42:55,694 - ERROR - `task.dependencies` references tasks that expires
[task 2024-07-31T22:42:55.695Z] before `task.deadline` this is not allowed, see tasks: 
[task 2024-07-31T22:42:55.695Z]  * QYRWW9BSTh-34mlpXqVVsg,
[task 2024-07-31T22:42:55.695Z]  * VF--UyYRTI6hby6Gr143Vw,
[task 2024-07-31T22:42:55.695Z] All taskIds in `task.dependencies` **must** have
[task 2024-07-31T22:42:55.695Z] `task.expires` greater than the `deadline` for this task.
[task 2024-07-31T22:42:55.695Z] 
[task 2024-07-31T22:42:55.695Z] 
[task 2024-07-31T22:42:55.695Z] ---
[task 2024-07-31T22:42:55.695Z] 
[task 2024-07-31T22:42:55.695Z] * method:     createTask
[task 2024-07-31T22:42:55.695Z] * errorCode:  InputError
[task 2024-07-31T22:42:55.695Z] * statusCode: 400
[task 2024-07-31T22:42:55.696Z] * time:       2024-07-31T22:42:55.690Z
[task 2024-07-31T22:42:56.170Z] Traceback (most recent call last):
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/main.py", line 708, in action_callback
[task 2024-07-31T22:42:56.171Z]     return trigger_action_callback(
[task 2024-07-31T22:42:56.171Z]            ^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/actions/registry.py", line 345, in trigger_action_callback
[task 2024-07-31T22:42:56.171Z]     cb(Parameters(**parameters), graph_config, input, task_group_id, task_id)
[task 2024-07-31T22:42:56.171Z]   File "/builds/worker/checkouts/src/taskcluster/translations_taskgraph/actions/train.py", line 442, in train_action
[task 2024-07-31T22:42:56.171Z]     taskgraph_decision({"root": graph_config.root_dir}, parameters=parameters)
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/decision.py", line 127, in taskgraph_decision
[task 2024-07-31T22:42:56.171Z]     create_tasks(
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/create.py", line 102, in create_tasks
[task 2024-07-31T22:42:56.171Z]     f.result()
[task 2024-07-31T22:42:56.171Z]   File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
[task 2024-07-31T22:42:56.171Z]     return self.__get_result()
[task 2024-07-31T22:42:56.171Z]            ^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z]   File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
[task 2024-07-31T22:42:56.171Z]     raise self._exception
[task 2024-07-31T22:42:56.171Z]   File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
[task 2024-07-31T22:42:56.171Z]     result = self.fn(*self.args, **self.kwargs)
[task 2024-07-31T22:42:56.171Z]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/create.py", line 132, in create_task
[task 2024-07-31T22:42:56.171Z]     res.raise_for_status()
[task 2024-07-31T22:42:56.171Z]   File "/usr/local/lib/python3.11/dist-packages/requests/models.py", line 1021, in raise_for_status
[task 2024-07-31T22:42:56.171Z]     raise HTTPError(http_error_msg, response=self)
[task 2024-07-31T22:42:56.171Z] requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://taskcluster/queue/v1/task/exBw1YtiTEez__8lJIj58w

Taskcluster log

Config:

# The initial configuration was generated using:
# task config-generator -- en uk --name spring-2024
#
# The documentation for this config can be found here:
# https://github.com/mozilla/firefox-translations-training/blob/001d15d5f2775e0e4c57717057a1906069e29dcc/taskcluster/configs/config.prod.yml
experiment:
  name: spring-2024
  src: en
  trg: uk
  best-model: chrf
  use-opuscleaner: 'true'
  opuscleaner-mode: defaults
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  mono-max-sentences-src: 500_000_000
  mono-max-sentences-trg: 200_000_000
  spm-sample-size: 10_000_000
  spm-vocab-size: 32000
  teacher-ensemble: 2
  teacher-mode: two-stage
  pretrained-models: {}
datasets:

  # Skipped test/devtest datasets:
  devtest:
  - flores_aug-mix_dev
  - mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-ukr
  test:
  - flores_devtest
  - flores_aug-mix_devtest
  - flores_aug-title_devtest
  - flores_aug-upper_devtest
  - flores_aug-typos_devtest
  - flores_aug-noise_devtest
  - flores_aug-inline-noise_devtest
  - mtdata_Neulab-tedtalks_test-1-eng-ukr

  # The training data contains:
  #   58,968,083 sentences
  #
  # Skipped datasets:
  #  - opus_CCMatrix/v1 - ignored datasets (20,240,171 sentences)
  #  - opus_MultiMaCoCu/v2 - ignored datasets (6,406,288 sentences)
  #  - opus_GNOME/v1 - not enough data  (150 sentences)
  #  - opus_Ubuntu/v14.10 - not enough data  (0 sentences)
  #  - mtdata_ELRC-wikipedia_health-1-eng-ukr - duplicate with opus
  #  - mtdata_Facebook-wikimatrix-1-eng-ukr - duplicate with opus
  #  - mtdata_Neulab-tedtalks_train-1-eng-ukr - duplicate with opus
  #  - mtdata_Statmt-ccaligned-1-eng-ukr_UA - duplicate with opus
  train:
  - opus_NLLB/v1  #                                       20,240,171 sentences
  - opus_ParaCrawl/v9 #                                  14,079,832 sentences
  - opus_CCAligned/v1 #                                   8,547,377 sentences
  - opus_MaCoCu/v2 #                                      6,406,294 sentences
  - opus_XLEnt/v1.2 #                                     3,671,061 sentences
  - opus_SUMMA/v1 #                                       1,574,611 sentences
  - opus_OpenSubtitles/v2018 #                              877,780 sentences
  - opus_wikimedia/v20230407 #                              757,910 sentences
  - opus_WikiMatrix/v1 #                                    681,115 sentences
  - opus_ELRC-5214-A_Lexicon_Named/v1 #                     495,403 sentences
  - opus_ELRC-5183-SciPar_Ukraine/v1 #                      306,813 sentences
  - opus_KDE4/v2 #                                          233,611 sentences
  - opus_QED/v2.0a #                                        215,630 sentences
  - opus_TED2020/v1 #                                       208,141 sentences
  - opus_Tatoeba/v2023-04-12 #                              175,502 sentences
  - opus_ELRC-5179-acts_Ukrainian/v1 #                      129,942 sentences
  - opus_ELRC-5180-Official_Parliament_/v1 #                116,260 sentences
  - opus_NeuLab-TedTalks/v1 #                               115,474 sentences
  - opus_ELRC-5181-Official_Parliament_/v1 #                 61,012 sentences
  - opus_ELRC-5174-French_Polish_Ukrain/v1 #                 36,228 sentences
  - opus_bible-uedin/v1 #                                    15,901 sentences
  - opus_ELRC-5182-Official_Parliament_/v1 #                  8,800 sentences
  - opus_ELRC-3043-wikipedia_health/v1 #                      2,735 sentences
  - opus_ELRC-wikipedia_health/v1 #                           2,735 sentences
  - opus_ELRC_2922/v1 #                                       2,734 sentences
  - opus_EUbookshop/v2 #                                      1,793 sentences
  - opus_TildeMODEL/v2018 #                                   1,628 sentences
  - opus_ELRC-5217-Ukrainian_Legal_MT/v1 #                      997 sentences
  - opus_tldr-pages/v2023-08-29 #                               593 sentences
  - mtdata_Tilde-worldbank-1-eng-ukr #                      ~2,011 sentences (227.3 kB)

  # The monolingual data contains:
  #   ~209,074,237 sentences
  mono-src:
  - news-crawl_news.2007  #          ~1,630,834 sentences (184.3 MB)
  - news-crawl_news.2008 #          ~5,648,654 sentences (638.3 MB)
  - news-crawl_news.2009 #          ~6,879,522 sentences (777.4 MB)
  - news-crawl_news.2010 #          ~3,406,380 sentences (384.9 MB)
  - news-crawl_news.2011 #          ~6,628,308 sentences (749.0 MB)
  - news-crawl_news.2012 #          ~6,715,609 sentences (758.9 MB)
  - news-crawl_news.2013 #         ~11,050,614 sentences (1.2 GB)
  - news-crawl_news.2014 #         ~11,026,051 sentences (1.2 GB)
  - news-crawl_news.2015 #         ~11,182,484 sentences (1.3 GB)
  - news-crawl_news.2016 #          ~8,366,518 sentences (945.4 MB)
  - news-crawl_news.2017 #         ~12,276,499 sentences (1.4 GB)
  - news-crawl_news.2018 #          ~8,303,432 sentences (938.3 MB)
  - news-crawl_news.2019 #         ~19,386,668 sentences (2.2 GB)
  - news-crawl_news.2020 #         ~24,070,652 sentences (2.7 GB)
  - news-crawl_news.2021 #         ~23,139,914 sentences (2.6 GB)
  - news-crawl_news.2022 #         ~24,900,055 sentences (2.8 GB)
  - news-crawl_news.2023 #         ~24,462,043 sentences (2.8 GB)

  # The monolingual data contains:
  #   ~1,940,719 sentences
  mono-trg:
  - news-crawl_news.2008  #              ~6,213 sentences (702.1 kB)
  - news-crawl_news.2009 #             ~31,947 sentences (3.6 MB)
  - news-crawl_news.2010 #              ~6,663 sentences (753.0 kB)
  - news-crawl_news.2011 #             ~61,690 sentences (7.0 MB)
  - news-crawl_news.2012 #             ~71,172 sentences (8.0 MB)
  - news-crawl_news.2013 #             ~86,086 sentences (9.7 MB)
  - news-crawl_news.2014 #             ~92,031 sentences (10.4 MB)
  - news-crawl_news.2015 #            ~106,485 sentences (12.0 MB)
  - news-crawl_news.2016 #             ~41,342 sentences (4.7 MB)
  - news-crawl_news.2018 #             ~88,962 sentences (10.1 MB)
  - news-crawl_news.2019 #            ~203,060 sentences (22.9 MB)
  - news-crawl_news.2020 #            ~221,669 sentences (25.0 MB)
  - news-crawl_news.2021 #            ~220,114 sentences (24.9 MB)
  - news-crawl_news.2022 #            ~247,965 sentences (28.0 MB)
  - news-crawl_news.2023 #            ~455,320 sentences (51.5 MB)
marian-args:
  decoding-backward:
    beam-size: '12'
    mini-batch-words: '2000'
  decoding-teacher:
    mini-batch-words: '4000'
    precision: float16
  training-backward:
    early-stopping: '5'
  training-teacher:
    early-stopping: '20'
  training-student:
    early-stopping: '20'
    mini-batch: '2000'
  training-student-finetuned:
    early-stopping: '20'
target-stage: all
start-stage: train-student
previous_group_ids: ["LYBo_BrUR8mkopI3Js2czQ"]
wandb-publication: true
taskcluster:
  split-chunks: 20
  worker-classes:
    default: gcp-spot
    alignments-original: gcp-standard
    alignments-backtranslated: gcp-standard
    alignments-student: gcp-standard
    shortlist: gcp-standard
    alignments-priors2: gcp-standard

It's not clear to me what the fix is here. I assume some old cached tasks are close to expiration. Even if we fix this model, there are others coming that will likely exceed the deadline of those old tasks if it's in August.

gabrielBusta commented 3 months ago

Seems like we should rebuild the toolchains?

Doing that may force the entire pipeline to run again. So we need to:

eu9ene commented 3 months ago

I restarted 5 actions with the rebuilt toolchains and they look working. We can close this.

The toolchains:

existing_tasks: {
        "build-docker-image-base": "BAvLUilqQ3SYqy6Ck55CUQ",
        "build-docker-image-test": "f0gbptvMTDaKODjqL9hlOw",
        "build-docker-image-toolchain-build": "LlZa8-L9TRemgyzQcAxuHw",
        "build-docker-image-train": "fBMJa9R5SKaXd2wgWeD5yQ",
        "fetch-browsermt-marian": "BRviRlEMTie8AUFf5prHvg",
        "fetch-cuda": "Kc8iWZguSyeGMZKY7OxnTQ",
        "fetch-cuda-11": "RjR9dsYTQhe0HQJPHNN4Tg",
        "fetch-cyhunspell": "XNYpMzBvSraicoNKyUIwxA",
        "fetch-extract-lex": "J2FS7TLLT4m2mjD0IGw91A",
        "fetch-fast-align": "Tim8u7s-TAeTYG5VnzmXfA",
        "fetch-hunspell": "Wn1pnCSQSpqKeRpCV52FqQ",
        "fetch-kenlm": "J4U7RFz2TASaNNTTqoQ8sg",
        "fetch-marian": "Sw_bpajdSgWxEDG3uW0-nQ",
        "fetch-preprocess": "Scn2N5dLRXKCEU4T1JYE3A",
        "toolchain-browsermt-marian": "aP5l3b05S9q3G25Nm85d6w",
        "toolchain-cuda-toolkit": "UuUG70nvSj2pHcKt8JFbKw",
        "toolchain-cuda-toolkit-11": "YhKI4TKlTFep-FpU7D2L7A",
        "toolchain-cyhunspell": "DTvS_tZeSluSlAHkViW3lg",
        "toolchain-extract-lex": "Xb7KAXA7TziSrxVQWS0Wmw",
        "toolchain-fast-align": "Ia-7gLTQSJeCj_RLs7sg4w",
        "toolchain-hunspell": "V84fX3jvQ-Knr4hZT9B8DQ",
        "toolchain-kenlm": "X6SgAIzhQlyL7g_nIfE-YQ",
        "toolchain-marian": "AoV-W4IzRo22lQBtJWsTxQ",
        "toolchain-marian-cpu": "Za5VkFoyS6mauNnmEYxV7g",
        "toolchain-preprocess": "ZozJMTdgQD-Bm9sSaG7soA"
    }
bhearsum commented 3 months ago

I will note that the acute issue is fixed, but we have the potential to hit similar things in the future. One thing we discussed is that we should probably force rebuild docker/toolchain tasks ahead of future big trainings so that those trainings will have ~1y before anything they depend on expires.