mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Can't restart the pipeline to run distillation from where it left #711

Open eu9ene opened 4 months ago

eu9ene commented 4 months ago

I'm trying to continue the pipeline after we get the results of the evaluate-teacher-ensemble task.

By doing something like:

target-stage: all
start-stage: translate-corpus
previous_group_ids: ["N1O85rIASLmCwKfUpCvTlw"]

or

start-stage: score

I'm getting an error:

[vcs 2024-06-28T00:15:22.294Z] TinderboxPrint:<a href='https://github.com/mozilla/firefox-translations-training/commit/e65c859948775bdf70b1eda8a7c233936d8e53c4' title='Built from firefox-translations-training commit e65c859948775bdf70b1eda8a7c233936d8e53c4'>e65c859948775bdf70b1eda8a7c233936d8e53c4</a>
[task 2024-06-28T00:15:22.294Z] executing ['bash', '-cx', 'cd /builds/worker/checkouts/src && ln -s /builds/worker/artifacts artifacts && taskgraph action-callback\n']
[task 2024-06-28T00:15:22.296Z] + cd /builds/worker/checkouts/src
[task 2024-06-28T00:15:22.296Z] + ln -s /builds/worker/artifacts artifacts
[task 2024-06-28T00:15:22.297Z] + taskgraph action-callback
[task 2024-06-28T00:15:23.187Z] Traceback (most recent call last):
[task 2024-06-28T00:15:23.188Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/main.py", line 708, in action_callback
[task 2024-06-28T00:15:23.189Z]     return trigger_action_callback(
[task 2024-06-28T00:15:23.190Z]            ^^^^^^^^^^^^^^^^^^^^^^^^
[task 2024-06-28T00:15:23.191Z]   File "/usr/local/lib/python3.11/dist-packages/taskgraph/actions/registry.py", line 345, in trigger_action_callback
[task 2024-06-28T00:15:23.191Z]     cb(Parameters(**parameters), graph_config, input, task_group_id, task_id)
[task 2024-06-28T00:15:23.191Z]   File "/builds/worker/checkouts/src/taskcluster/translations_taskgraph/actions/train.py", line 397, in train_action
[task 2024-06-28T00:15:23.191Z]     start_task_ids.append(label_to_task_id[label])
[task 2024-06-28T00:15:23.191Z]                           ~~~~~~~~~~~~~~~~^^^^^^^
[task 2024-06-28T00:15:23.191Z] KeyError: 'translate-corpus-da-en-1/20'

Completed group

log1 log2

Workaround:

start-stage: evaluate-teacher-ensemble

It reruns the evaluation again but at least schedules other things properly.

bhearsum commented 3 months ago

The start-stage system requires that one of the previous groups contains the start-stage task. (This is a documented caveat in https://github.com/mozilla/firefox-translations-training/blob/main/docs/task-cluster.md#running-only-later-parts-of-the-pipeline (originally from https://github.com/mozilla/firefox-translations-training/pull/377)).

This is because in any case where multiple previous_group_ids are specified, we could end up with conflicts for the same label, and end up replacing with the wrong task. (And in general, the behaviour would become non-deterministic, which is not great...)

We could conceivably fix this by allowing start_task_ids to be specified explicitly (which we'd either use in addition to or instead of the automatically detected ones. Or perhaps we should wait until we discuss #719 more before adding more hacks here?