jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Release of the Cloze Test set? #125

Closed KhanhTungTran closed 6 months ago

KhanhTungTran commented 8 months ago

Hello,

Thank you for the great work!

I am a PhD student in AI at University College Cork, and I am interested in training large-scale language models for Irish.

I am wondering if you can release the Cloze Test set used in your paper? It will be a great resource for evaluation of Irish-based language models.

Thank you.

jbrry commented 8 months ago

Hi there, thanks for checking out the paper and the nice words!

@laurenCassidy do you still have the corpus/scripts used in the evaluation?

It may also be worth looking at this script to inspect various LMs on masked token prediction.

laurenCassidy commented 8 months ago

Hi @KhanhTungTran, Here is the corpus used for the cloze test: cloze.csv

laurenCassidy commented 8 months ago

@jbrry will I add the cloze test script to the repository or share it another way? GitHub doesn't allow me to attact .py file to comment

KhanhTungTran commented 8 months ago

Thank you for the corpus file!

jbrry commented 8 months ago

will I add the cloze test script to the repository or share it another way? GitHub doesn't allow me to attact .py file to comment

Thanks @laurenCassidy, if you could add it to the scripts directory that would be great!

laurenCassidy commented 8 months ago

I have added the cloze.py script to the scripts directory

KhanhTungTran commented 2 months ago

Hi guys,

Just want to let you know that we have published our articles on building Irish-based LLMs, where your Cloze Test set was an important resource for evaluation.

As our models are causal language model, we have to redo the evaluation method to fit with causal LM instead of masked LM. However, our best accuracy was 0.80, approximately that of your gaBert model, indicating that more can be done to develop a proficient Irish language model.

[1] Tran, K. T., O'Sullivan, B., & Nguyen, H. D. UCCIX: Irish-eXcellence Large Language Model (2024). In ECAI2024 - Demo Track. [2] Tran, K.T., O'Sullivan, B. and Nguyen, H.D. Irish-based Large Language Model with extreme low-resource settings in machine translation (2024). In LoResMT2024, ACL.