mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

TaskCluster not via-CI #403

Open AmitMY opened 9 months ago

AmitMY commented 9 months ago

Since you are not maintaining Snakemake, I'd like to use TaskCluster. I read these instructions - https://github.com/mozilla/firefox-translations-training/blob/main/docs/task-cluster.md which seem to claim that training runs happen from git CI.

I would like to run taskcluster locally, and configure it to my GCP instance.

Seems like I need to start with

git clone https://github.com/taskcluster/taskcluster
cd taskcluster
docker compose up
echo '127.0.0.1 taskcluster' >> /etc/hosts

Now opening http://taskcluster opens taskcluster.

From here, how can I push the tasks group in this repository to the taskcluster? I feel like the tutorial should cover that . Also, will the tasks spawn GCP workers as needed, or should those be created ahead of time?

gregtatum commented 9 months ago

I'm not a taskcluster expert, and maybe others can chime in here.

This has information on the taskgraph that is generated: https://taskcluster-taskgraph.readthedocs.io/en/latest/

If you run the utils/preflight_check.py, it will generate a local taskgraph that you can inspect. It is located in the /artifacts directory in the repo. I know there is a artifacts/run-task that is in there. The artifacts/full-task-graph.json contains all of the tasks that need to run.

As far as how taskcluster works beyond that is beyond my understanding of the system.

There is the https://chat.mozilla.org/#/room/#taskcluster:mozilla.org group that may answer questions.

AmitMY commented 9 months ago

Getting the tasks graph using:

make preflight-check

the run-task seems to need to run on the servers, not on my client. I still can't figure out how to do it outside of CI though.

My goal is:

  1. get a small VM running taskcluster
  2. "push" a tasks graph to it
  3. from taskcluster, start a new training job, which will spawn GCP instances to run tasks
bhearsum commented 9 months ago

Apologies for the slow reply - I didn't see this issue until now.

It is technically possible to run your own Taskcluster instance and run training on it, although I'm not sure I would advise it. Roughly, the steps would be:

The Taskcluster channel that @gregtatum linked to is usually pretty keen to help others get the core Taskcluster services working, but I'm not sure how much guidance they'll be able to offer on Translations-specific things, nor can I commit to helping with this.

marco-c commented 9 months ago

Another option that we have discussed for the future is to build a feature in Taskgraph to generate a Snakemake definition in addition to a Taskcluster one. We are not sure if/when we'll be able to build it though.

AmitMY commented 9 months ago

Thanks @bhearsum - I guess since I don't really have permissions on Mozilla's cluster, my only course of action is to set up a new instance.

@marco-c that would be swell! I think that would allow for much easier experimentation for researchers. Until now, I was running it in a docker container on a single 4 GPU machine, and it worked fine, except the translation performance was poor. Now that many bugs should be fixed, I wanted to try again but the snakemake definitions are out-of-date.