Closed KhanhTungTran closed 6 months ago
Hi there, thanks for checking out the paper and the nice words!
@laurenCassidy do you still have the corpus/scripts used in the evaluation?
It may also be worth looking at this script to inspect various LMs on masked token prediction.
Hi @KhanhTungTran, Here is the corpus used for the cloze test: cloze.csv
@jbrry will I add the cloze test script to the repository or share it another way? GitHub doesn't allow me to attact .py file to comment
Thank you for the corpus file!
will I add the cloze test script to the repository or share it another way? GitHub doesn't allow me to attact .py file to comment
Thanks @laurenCassidy, if you could add it to the scripts
directory that would be great!
I have added the cloze.py script to the scripts directory
Hi guys,
Just want to let you know that we have published our articles on building Irish-based LLMs, where your Cloze Test set was an important resource for evaluation.
As our models are causal language model, we have to redo the evaluation method to fit with causal LM instead of masked LM. However, our best accuracy was 0.80, approximately that of your gaBert model, indicating that more can be done to develop a proficient Irish language model.
[1] Tran, K. T., O'Sullivan, B., & Nguyen, H. D. UCCIX: Irish-eXcellence Large Language Model (2024). In ECAI2024 - Demo Track. [2] Tran, K.T., O'Sullivan, B. and Nguyen, H.D. Irish-based Large Language Model with extreme low-resource settings in machine translation (2024). In LoResMT2024, ACL.
Hello,
Thank you for the great work!
I am a PhD student in AI at University College Cork, and I am interested in training large-scale language models for Irish.
I am wondering if you can release the Cloze Test set used in your paper? It will be a great resource for evaluation of Irish-based language models.
Thank you.