Closed eu9ene closed 3 months ago
Copying discussion from Slack related to the sudden rebuilding of the toolchains.
@bhearsum:
We have a guard in taskgraph that will not use a cached task if that task expires before the deadline of a task that depends on it. This is to ensure that any tasks upstream of the ones we're about to create will be available at all possible times the tasks could run. In this case, this task is the one we should've re-used, whose expiry is 2024-08-14T11:28:43.692Z. One of the tasks that wanted it ended up being created with a deadline of 2024-08-18T18:26:25.727Z - 4 days after the cached task expires. For the short term, I suggest you reduce the default task deadline such that it will be before 2024-08-14. If you do that and try again with the same training config, I expect you'll pick up the tasks you expect. For the medium term, we should lengthen the expiry time of toolchain, and perhaps other tasks. And it may be a good idea to force rebuilds at the start of big trainings to make sure we don't depend on any cached tasks that may be expiring soon.
Reducing the deadline to 10 days helped.
Closing this since we've landed a workaround but I guess it doesn't solve the problem for some cases. Let's reopen if we see it again in the future.
It's back! I'm trying to run this config from this decision task and getting:
[task 2024-07-31T22:42:55.694Z] 2024-07-31 22:42:55,688 - INFO - Creating task with taskId EvcR5CreSt-bejbVc-tAmw for finetune-student-en-uk
[task 2024-07-31T22:42:55.695Z] 2024-07-31 22:42:55,694 - ERROR - `task.dependencies` references tasks that expires
[task 2024-07-31T22:42:55.695Z] before `task.deadline` this is not allowed, see tasks:
[task 2024-07-31T22:42:55.695Z] * QYRWW9BSTh-34mlpXqVVsg,
[task 2024-07-31T22:42:55.695Z] * VF--UyYRTI6hby6Gr143Vw,
[task 2024-07-31T22:42:55.695Z] All taskIds in `task.dependencies` **must** have
[task 2024-07-31T22:42:55.695Z] `task.expires` greater than the `deadline` for this task.
[task 2024-07-31T22:42:55.695Z]
[task 2024-07-31T22:42:55.695Z]
[task 2024-07-31T22:42:55.695Z] ---
[task 2024-07-31T22:42:55.695Z]
[task 2024-07-31T22:42:55.695Z] * method: createTask
[task 2024-07-31T22:42:55.695Z] * errorCode: InputError
[task 2024-07-31T22:42:55.695Z] * statusCode: 400
[task 2024-07-31T22:42:55.696Z] * time: 2024-07-31T22:42:55.690Z
[task 2024-07-31T22:42:56.170Z] Traceback (most recent call last):
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/taskgraph/main.py", line 708, in action_callback
[task 2024-07-31T22:42:56.171Z] return trigger_action_callback(
[task 2024-07-31T22:42:56.171Z] ^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/taskgraph/actions/registry.py", line 345, in trigger_action_callback
[task 2024-07-31T22:42:56.171Z] cb(Parameters(**parameters), graph_config, input, task_group_id, task_id)
[task 2024-07-31T22:42:56.171Z] File "/builds/worker/checkouts/src/taskcluster/translations_taskgraph/actions/train.py", line 442, in train_action
[task 2024-07-31T22:42:56.171Z] taskgraph_decision({"root": graph_config.root_dir}, parameters=parameters)
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/taskgraph/decision.py", line 127, in taskgraph_decision
[task 2024-07-31T22:42:56.171Z] create_tasks(
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/taskgraph/create.py", line 102, in create_tasks
[task 2024-07-31T22:42:56.171Z] f.result()
[task 2024-07-31T22:42:56.171Z] File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
[task 2024-07-31T22:42:56.171Z] return self.__get_result()
[task 2024-07-31T22:42:56.171Z] ^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z] File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
[task 2024-07-31T22:42:56.171Z] raise self._exception
[task 2024-07-31T22:42:56.171Z] File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
[task 2024-07-31T22:42:56.171Z] result = self.fn(*self.args, **self.kwargs)
[task 2024-07-31T22:42:56.171Z] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/taskgraph/create.py", line 132, in create_task
[task 2024-07-31T22:42:56.171Z] res.raise_for_status()
[task 2024-07-31T22:42:56.171Z] File "/usr/local/lib/python3.11/dist-packages/requests/models.py", line 1021, in raise_for_status
[task 2024-07-31T22:42:56.171Z] raise HTTPError(http_error_msg, response=self)
[task 2024-07-31T22:42:56.171Z] requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://taskcluster/queue/v1/task/exBw1YtiTEez__8lJIj58w
Config:
# The initial configuration was generated using:
# task config-generator -- en uk --name spring-2024
#
# The documentation for this config can be found here:
# https://github.com/mozilla/firefox-translations-training/blob/001d15d5f2775e0e4c57717057a1906069e29dcc/taskcluster/configs/config.prod.yml
experiment:
name: spring-2024
src: en
trg: uk
best-model: chrf
use-opuscleaner: 'true'
opuscleaner-mode: defaults
bicleaner:
default-threshold: 0.5
dataset-thresholds: {}
mono-max-sentences-src: 500_000_000
mono-max-sentences-trg: 200_000_000
spm-sample-size: 10_000_000
spm-vocab-size: 32000
teacher-ensemble: 2
teacher-mode: two-stage
pretrained-models: {}
datasets:
# Skipped test/devtest datasets:
devtest:
- flores_aug-mix_dev
- mtdata_aug-mix_Neulab-tedtalks_dev-1-eng-ukr
test:
- flores_devtest
- flores_aug-mix_devtest
- flores_aug-title_devtest
- flores_aug-upper_devtest
- flores_aug-typos_devtest
- flores_aug-noise_devtest
- flores_aug-inline-noise_devtest
- mtdata_Neulab-tedtalks_test-1-eng-ukr
# The training data contains:
# 58,968,083 sentences
#
# Skipped datasets:
# - opus_CCMatrix/v1 - ignored datasets (20,240,171 sentences)
# - opus_MultiMaCoCu/v2 - ignored datasets (6,406,288 sentences)
# - opus_GNOME/v1 - not enough data (150 sentences)
# - opus_Ubuntu/v14.10 - not enough data (0 sentences)
# - mtdata_ELRC-wikipedia_health-1-eng-ukr - duplicate with opus
# - mtdata_Facebook-wikimatrix-1-eng-ukr - duplicate with opus
# - mtdata_Neulab-tedtalks_train-1-eng-ukr - duplicate with opus
# - mtdata_Statmt-ccaligned-1-eng-ukr_UA - duplicate with opus
train:
- opus_NLLB/v1 # 20,240,171 sentences
- opus_ParaCrawl/v9 # 14,079,832 sentences
- opus_CCAligned/v1 # 8,547,377 sentences
- opus_MaCoCu/v2 # 6,406,294 sentences
- opus_XLEnt/v1.2 # 3,671,061 sentences
- opus_SUMMA/v1 # 1,574,611 sentences
- opus_OpenSubtitles/v2018 # 877,780 sentences
- opus_wikimedia/v20230407 # 757,910 sentences
- opus_WikiMatrix/v1 # 681,115 sentences
- opus_ELRC-5214-A_Lexicon_Named/v1 # 495,403 sentences
- opus_ELRC-5183-SciPar_Ukraine/v1 # 306,813 sentences
- opus_KDE4/v2 # 233,611 sentences
- opus_QED/v2.0a # 215,630 sentences
- opus_TED2020/v1 # 208,141 sentences
- opus_Tatoeba/v2023-04-12 # 175,502 sentences
- opus_ELRC-5179-acts_Ukrainian/v1 # 129,942 sentences
- opus_ELRC-5180-Official_Parliament_/v1 # 116,260 sentences
- opus_NeuLab-TedTalks/v1 # 115,474 sentences
- opus_ELRC-5181-Official_Parliament_/v1 # 61,012 sentences
- opus_ELRC-5174-French_Polish_Ukrain/v1 # 36,228 sentences
- opus_bible-uedin/v1 # 15,901 sentences
- opus_ELRC-5182-Official_Parliament_/v1 # 8,800 sentences
- opus_ELRC-3043-wikipedia_health/v1 # 2,735 sentences
- opus_ELRC-wikipedia_health/v1 # 2,735 sentences
- opus_ELRC_2922/v1 # 2,734 sentences
- opus_EUbookshop/v2 # 1,793 sentences
- opus_TildeMODEL/v2018 # 1,628 sentences
- opus_ELRC-5217-Ukrainian_Legal_MT/v1 # 997 sentences
- opus_tldr-pages/v2023-08-29 # 593 sentences
- mtdata_Tilde-worldbank-1-eng-ukr # ~2,011 sentences (227.3 kB)
# The monolingual data contains:
# ~209,074,237 sentences
mono-src:
- news-crawl_news.2007 # ~1,630,834 sentences (184.3 MB)
- news-crawl_news.2008 # ~5,648,654 sentences (638.3 MB)
- news-crawl_news.2009 # ~6,879,522 sentences (777.4 MB)
- news-crawl_news.2010 # ~3,406,380 sentences (384.9 MB)
- news-crawl_news.2011 # ~6,628,308 sentences (749.0 MB)
- news-crawl_news.2012 # ~6,715,609 sentences (758.9 MB)
- news-crawl_news.2013 # ~11,050,614 sentences (1.2 GB)
- news-crawl_news.2014 # ~11,026,051 sentences (1.2 GB)
- news-crawl_news.2015 # ~11,182,484 sentences (1.3 GB)
- news-crawl_news.2016 # ~8,366,518 sentences (945.4 MB)
- news-crawl_news.2017 # ~12,276,499 sentences (1.4 GB)
- news-crawl_news.2018 # ~8,303,432 sentences (938.3 MB)
- news-crawl_news.2019 # ~19,386,668 sentences (2.2 GB)
- news-crawl_news.2020 # ~24,070,652 sentences (2.7 GB)
- news-crawl_news.2021 # ~23,139,914 sentences (2.6 GB)
- news-crawl_news.2022 # ~24,900,055 sentences (2.8 GB)
- news-crawl_news.2023 # ~24,462,043 sentences (2.8 GB)
# The monolingual data contains:
# ~1,940,719 sentences
mono-trg:
- news-crawl_news.2008 # ~6,213 sentences (702.1 kB)
- news-crawl_news.2009 # ~31,947 sentences (3.6 MB)
- news-crawl_news.2010 # ~6,663 sentences (753.0 kB)
- news-crawl_news.2011 # ~61,690 sentences (7.0 MB)
- news-crawl_news.2012 # ~71,172 sentences (8.0 MB)
- news-crawl_news.2013 # ~86,086 sentences (9.7 MB)
- news-crawl_news.2014 # ~92,031 sentences (10.4 MB)
- news-crawl_news.2015 # ~106,485 sentences (12.0 MB)
- news-crawl_news.2016 # ~41,342 sentences (4.7 MB)
- news-crawl_news.2018 # ~88,962 sentences (10.1 MB)
- news-crawl_news.2019 # ~203,060 sentences (22.9 MB)
- news-crawl_news.2020 # ~221,669 sentences (25.0 MB)
- news-crawl_news.2021 # ~220,114 sentences (24.9 MB)
- news-crawl_news.2022 # ~247,965 sentences (28.0 MB)
- news-crawl_news.2023 # ~455,320 sentences (51.5 MB)
marian-args:
decoding-backward:
beam-size: '12'
mini-batch-words: '2000'
decoding-teacher:
mini-batch-words: '4000'
precision: float16
training-backward:
early-stopping: '5'
training-teacher:
early-stopping: '20'
training-student:
early-stopping: '20'
mini-batch: '2000'
training-student-finetuned:
early-stopping: '20'
target-stage: all
start-stage: train-student
previous_group_ids: ["LYBo_BrUR8mkopI3Js2czQ"]
wandb-publication: true
taskcluster:
split-chunks: 20
worker-classes:
default: gcp-spot
alignments-original: gcp-standard
alignments-backtranslated: gcp-standard
alignments-student: gcp-standard
shortlist: gcp-standard
alignments-priors2: gcp-standard
It's not clear to me what the fix is here. I assume some old cached tasks are close to expiration. Even if we fix this model, there are others coming that will likely exceed the deadline of those old tasks if it's in August.
Seems like we should rebuild the toolchains?
Doing that may force the entire pipeline to run again. So we need to:
I restarted 5 actions with the rebuilt toolchains and they look working. We can close this.
The toolchains:
existing_tasks: {
"build-docker-image-base": "BAvLUilqQ3SYqy6Ck55CUQ",
"build-docker-image-test": "f0gbptvMTDaKODjqL9hlOw",
"build-docker-image-toolchain-build": "LlZa8-L9TRemgyzQcAxuHw",
"build-docker-image-train": "fBMJa9R5SKaXd2wgWeD5yQ",
"fetch-browsermt-marian": "BRviRlEMTie8AUFf5prHvg",
"fetch-cuda": "Kc8iWZguSyeGMZKY7OxnTQ",
"fetch-cuda-11": "RjR9dsYTQhe0HQJPHNN4Tg",
"fetch-cyhunspell": "XNYpMzBvSraicoNKyUIwxA",
"fetch-extract-lex": "J2FS7TLLT4m2mjD0IGw91A",
"fetch-fast-align": "Tim8u7s-TAeTYG5VnzmXfA",
"fetch-hunspell": "Wn1pnCSQSpqKeRpCV52FqQ",
"fetch-kenlm": "J4U7RFz2TASaNNTTqoQ8sg",
"fetch-marian": "Sw_bpajdSgWxEDG3uW0-nQ",
"fetch-preprocess": "Scn2N5dLRXKCEU4T1JYE3A",
"toolchain-browsermt-marian": "aP5l3b05S9q3G25Nm85d6w",
"toolchain-cuda-toolkit": "UuUG70nvSj2pHcKt8JFbKw",
"toolchain-cuda-toolkit-11": "YhKI4TKlTFep-FpU7D2L7A",
"toolchain-cyhunspell": "DTvS_tZeSluSlAHkViW3lg",
"toolchain-extract-lex": "Xb7KAXA7TziSrxVQWS0Wmw",
"toolchain-fast-align": "Ia-7gLTQSJeCj_RLs7sg4w",
"toolchain-hunspell": "V84fX3jvQ-Knr4hZT9B8DQ",
"toolchain-kenlm": "X6SgAIzhQlyL7g_nIfE-YQ",
"toolchain-marian": "AoV-W4IzRo22lQBtJWsTxQ",
"toolchain-marian-cpu": "Za5VkFoyS6mauNnmEYxV7g",
"toolchain-preprocess": "ZozJMTdgQD-Bm9sSaG7soA"
}
I will note that the acute issue is fixed, but we have the potential to hit similar things in the future. One thing we discussed is that we should probably force rebuild docker/toolchain tasks ahead of future big trainings so that those trainings will have ~1y before anything they depend on expires.
It seems the issue with the deadline is even more serious. My train action with caching fails now:
https://firefox-ci-tc.services.mozilla.com/tasks/DQkkrHmNQP6dSjg5p_YYfQ/runs/0
I'm running it from this PR push task: https://github.com/mozilla/firefox-translations-training/pull/690 With this config, which is a regular way to use the cached tasks: