I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have to move tokenization to a separate step for that and somehow adjust the cleaning step to work with that instead of the original text. I filed https://github.com/mozilla/firefox-translations-training/issues/899
Oh, and I don't have opinions on the rules themselves, copying from another source seems reasonable, but I didn't think through the rules and how they apply to CJK.
Use custom OpusCleaner configs with disabled word-based filters.
The filters are copied from https://github.com/hplt-project/HPLT-MT-Models/blob/main/v1.0/data/en-zh_hant/raw/v2/HPLT-v1.1.en-zh_hant.filters.json.
I don't think it's feasible to do the src-trg-ratio that requires tokenization now. We would have to move tokenization to a separate step for that and somehow adjust the cleaning step to work with that instead of the original text. I filed https://github.com/mozilla/firefox-translations-training/issues/899
closes #742