hplt-project / OpusPocus

Marian machine translation training pipeline for thousands of models
2 stars 0 forks source link

Corpus sharding simplification #43

Closed varisd closed 3 months ago

varisd commented 3 months ago

I simplified the sharding support in CorpusStep and removed the implicit sharding of the inputs. Only sharded output is now generated before merging into final file.

I also added proper unit testing for corpus_step and sharding-related utils.