Tran et al. (2024) "UCCIX: Irish-eXcellence Large Language Model" cite the below paper for the source "Corpora Irish" with 37.1M character before and 11.1M character after preprocessing. However, the cited paper does not mention Irish. Scottish Gaelic is listed as one of the included languages.
Tran et al. (2024) "UCCIX: Irish-eXcellence Large Language Model" cite the below paper for the source "Corpora Irish" with 37.1M character before and 11.1M character after preprocessing. However, the cited paper does not mention Irish. Scottish Gaelic is listed as one of the included languages.
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 759–765, Istanbul, Turkey. European Language Resources Association (ELRA).
Alternative URL: https://aclanthology.org/L12-1154/