Open jowagner opened 4 months ago
Extract Irish subset of Nguyen et al. (2024) CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Irish / ga is not mentioned in the paper but listed on https://huggingface.co/datasets/uonlp/CulturaX with 377M tokens.
Extract Irish subset of Nguyen et al. (2024) CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Irish / ga is not mentioned in the paper but listed on https://huggingface.co/datasets/uonlp/CulturaX with 377M tokens.