jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Increase weight of clean corpora such as NCI #53

Open jowagner opened 3 years ago

jowagner commented 3 years ago

When combining NCI with common crawl, paracrawl, OSCAR and other noisy corpora, it may be beneficial to give more weight to clean corpora, e.g. by concatenating multiple copies.

jbrry commented 3 years ago

Yes, sounds like a good idea. I can repeat the best performing gaBERT model with this (and the new segmentation).