hplt-project / OpusPocus

Marian machine translation training pipeline for thousands of models
2 stars 0 forks source link

DecontaminateStep does not properly process parallel corpora #33

Closed varisd closed 2 months ago

varisd commented 3 months ago

DecontaminateStep currently processes each corpus individually as a monolingual corpus which can result in removing varying number of lines across a dataset languages.

varisd commented 2 months ago

Solved in PR #37