pranaydeeps / Ancient-Greek-BERT

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"
GNU General Public License v3.0
32 stars 4 forks source link

Ancient Greek BERT

Note: The Morphological Analysis Tagger has issues loading on some machines and gives incorrect outputs due to an issue with the FLAIR Toolkit. If you run into this problem, please open an issue and we can try to help!

The first and only available Ancient Greek sub-word BERT model!

State-of-the-art post fine-tuning on Part-of-Speech Tagging and Morphological Analysis.

Pre-trained weights are made available for a standard 12 layer, 768d BERT-base model.

You can also use the model directly on the HuggingFace Model Hub here.

Please refer to our paper titled: "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek". In Proceedings of The 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2021).

How to use

Requirements:

pip install transformers
pip install unicodedata
pip install flair

Can be directly used from the HuggingFace Model Hub with:

from transformers import AutoTokenizer, AutoModel
tokeniser = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")  

Fine-tuning for POS/Morphological Analysis

from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')

Training data

The model was initialised from AUEB NLP Group's Greek BERT and subsequently trained on monolingual data from the First1KGreek Project, Perseus Digital Library, PROIEL Treebank and Gorman's Treebank

Training and Eval details

Standard de-accentuating and lower-casing for Greek as suggested in AUEB NLP Group's Greek BERT. The model was trained on 4 NVIDIA Tesla V100 16GB GPUs for 80 epochs, with a max-seq-len of 512 and results in a perplexity of 4.8 on the held out test set. It also gives state-of-the-art results when fine-tuned for PoS Tagging and Morphological Analysis on all 3 treebanks averaging >90% accuracy. Please consult our paper or contact me for further questions!

Cite

If you end up using Ancient-Greek-BERT in your research, please cite the paper:

@inproceedings{ancient-greek-bert,
author = {Singh, Pranaydeep and Rutten, Gorik and Lefever, Els},
title = {A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek},
year = {2021},
booktitle = {The 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2021)}
}