mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 34 forks source link

Add community contribution guidelines #387

Open eu9ene opened 9 months ago

eu9ene commented 9 months ago

People keep asking how to help add another language.

  1. The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how much data is there, including monolingual datasets.

  2. Contributing datasets that are not on OPUS or mtdata. A good example is when folks provided data for Catalan and now @gregtatum is experimenting with it.

  3. Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

  4. For those looking to train a language pair themselves helping with maintaining Snakemake would be handy.

  5. We might have simple issues to take care of as a part of the training pipeline

We can setup a workflow on Github by creating an issue for a language (ideally with a template) and adding all the stats and discussing things related to the language there.

We should add a doc with clear guidelines on all this.

marco-c commented 9 months ago

Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.

I think something like https://github.com/hplt-project/OpusCleaner/issues/148#issuecomment-1905590936 would be ideal here.