Closed bhearsum closed 2 months ago
It's not really a bug. We don't support monolingual datasets from OPUS. I think they started adding them recently. Also we have plenty of data for back-translation for English from news-crawl and for other languages if they are on OPUS as parallel data we'll want to use it as train
and not mono.
Looking at this again, the NLLB dataset can have language pairs that don't include English. For instance in Catalan, there are 21M sentences that are en-ca, while 65M for es-ca. I could see using the Catalan side of this language pair for monolingual data. I don't see anywhere that someone has built a dataset of monolingual data from NLLB.
@gregtatum has been trying to use an
opus
dataset as one of themono
datasets. For example, with this training config:Note the
opus_MaCoCu/v2
inmono-src
.When run, we end up with:
This is because over in the dataset kind we name
opus
datasets with both thesrc
andtrg
locale in their name, while the clean-mono kind is looking for something named with justsrc
ortrg
.The obvious thing to do is to always name all datasets with both
src
andtrg
locales...but the fact that we have some datasets that are monolingual make this a non-starter (I think). (I gave this a quick try and ended up with some complaints aboutnews-crawl
instead at least...)My horrible hack to unstick this in the short term was:
(This is not remotely landable or good - I'm mainly putting it here for future reference.)