jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Include Irish subset of Leipzig Corpora Collection #126

Open jowagner opened 4 months ago

jowagner commented 4 months ago

Tran et al. (2024) "UCCIX: Irish-eXcellence Large Language Model" cite the below paper for the source "Corpora Irish" with 37.1M character before and 11.1M character after preprocessing. However, the cited paper does not mention Irish. Scottish Gaelic is listed as one of the included languages.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 759–765, Istanbul, Turkey. European Language Resources Association (ELRA).

Alternative URL: https://aclanthology.org/L12-1154/