Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation strategy for CJK languages, maybe just a byte limit.
This will be solved by removing such filtering from the importer. It's filtered in the monolingual cleaning step anyway. Ideally, we should use OpusCleaner there but it doesn't support monolingual corpus yet.
Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation strategy for CJK languages, maybe just a byte limit.