jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Investigate mC4 dataset #89

Open jbrry opened 2 years ago

jbrry commented 2 years ago

There seems to be 465,670 examples for ga in Google's c4 dataset. There are also 322,404 gd (Scottish Gaelic) sentences. This is a collection of multilingual text from 71 CommonCrawl dumps.

There is some advice to download the dataset here. It's also integrated into HuggingFace's datasets.

We already have CommonCrawl data, but it would be worth looking at this resource too.