jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Include Irish subset of CulturaX #127

Open jowagner opened 4 months ago

jowagner commented 4 months ago

Extract Irish subset of Nguyen et al. (2024) CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Irish / ga is not mentioned in the paper but listed on https://huggingface.co/datasets/uonlp/CulturaX with 377M tokens.