Open eu9ene opened 10 months ago
Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.
I think something like https://github.com/hplt-project/OpusCleaner/issues/148#issuecomment-1905590936 would be ideal here.
People keep asking how to help add another language.
The first good step would be helping to research datasets. To estimate feasibility of training we need statistics on how much data is there, including monolingual datasets.
Contributing datasets that are not on OPUS or mtdata. A good example is when folks provided data for Catalan and now @gregtatum is experimenting with it.
Helping tuning cleaning rules. We just started looking into OpusCleaner ourselves. In the future we could provide a guide on how to run the UI, tune rules for a language pair and contribute configs to the repo.
For those looking to train a language pair themselves helping with maintaining Snakemake would be handy.
We might have simple issues to take care of as a part of the training pipeline
We can setup a workflow on Github by creating an issue for a language (ideally with a template) and adding all the stats and discussing things related to the language there.
We should add a doc with clear guidelines on all this.