pranaydeeps / Ancient-Greek-BERT

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"
GNU General Public License v3.0
33 stars 5 forks source link

Morphological Analysis Examples #1

Open PeterPirog opened 2 years ago

PeterPirog commented 2 years ago

@pranaydeeps Is possible to add simple code example how to do morphological analysis of some short ancient greek sentence? Maybe some examples of:

Thank You very much for the model. I tried to train smaller BERT model but i didn't have enough GPU resources. I would like to use your model for New Testament analysis.

pranaydeeps commented 2 years ago

I think I have some jupyter notebooks which I used for my experiments and analysis. I will try to clean the code a bit and upload it whenever I can! Meanwhile if you have a urgent need for it you can take a look at the FLAIR toolkit documentation, since the Morphological Analysis model is trained using the FLAIR toolkit.

PeterPirog commented 2 years ago

@pranaydeeps Hi, thank You for the answer. This isn't urgent so I will wait patiently and try to do something myself. Now I use code:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")

###Tokenize the sentences like before:
sent = [
    " Ἀπὸ δὲ ἕκτης ὥρας σκότος ἐγένετο ἐπὶ πᾶσαν τὴν γῆν ἕως ὥρας ἐννάτης", # Mt 27.45 sentence 1
    " γενομένης δὲ ὥρας ἕκτης σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης", # Mk 15.33 sentence 2
    " Οἱ δὲ παραπορευόμενοι ἐβλασφήμουν αὐτὸν κινοῦντες τὰς κεφαλὰς αὐτῶν", # Mt 27.39 sentence 3
    " ἦν δὲ ὡσεὶ ὥρα ἕκτη Καὶ σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης" # Lk23.44 sentence 4
]
 # sentence 1 to senetnce 2 = 0.94861794
 # sentence 1 to senetnce 3 = 0.6118592
 # sentence 1 to senetnce 4 = 0.9064161

# initialize dictionary: stores tokenized sentences
token = {'input_ids': [], 'attention_mask': []}
for sentence in sent:
    # encode each sentence, append to dictionary
    new_token = tokenizer.encode_plus(sentence, max_length=128,
                                      truncation=True, padding='max_length',
                                      return_tensors='pt')
    token['input_ids'].append(new_token['input_ids'][0])
    token['attention_mask'].append(new_token['attention_mask'][0])
# reformat list of tensors to single tensor
token['input_ids'] = torch.stack(token['input_ids'])
token['attention_mask'] = torch.stack(token['attention_mask'])

# Process tokens through model:
output = model(**token)
print(output.keys())

# The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor
embeddings = output.last_hidden_state
print(embeddings)

# To perform this operation, we first resize our attention_mask tensor:
att_mask = token['attention_mask']
att_mask.shape

#    output: torch.Size([4, 128])

mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

#    Output: torch.Size([4, 128, 768])

mask_embeddings = embeddings * mask
mask_embeddings.shape

#    Output: torch.Size([4, 128, 768])

# Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(mask_embeddings, 1)
summed.shape

#    Output: torch.Size([4, 768])

# Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

#    Output: torch.Size([4, 768])

mean_pooled = summed / summed_mask
print(mean_pooled)

from sklearn.metrics.pairwise import cosine_similarity

# Let's calculate cosine similarity for sentence 0:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()
# calculate
similarity = cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:])

print(similarity)

but maybe there is some better way to do it. As I know the biggest language corpus is TLG http://stephanus.tlg.uci.edu/ but unfortunatelly even untagged texts aren't open source.

pranaydeeps commented 2 years ago

@PeterPirog apologies for the delayed reply. You can use something similar to get Morphological Analysis outputs from the pre-trained model if your text is saved line by line in the file "input_text_clean.txt":

from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')

with open("../input_text_clean.txt", "r") as testfile:
    test_list = test.readlines()

outfile = open("../morph_analysis_outputs.txt", "w")
for testitem in test_list:
    sentence = Sentence(testitem)
    tagger.predict(sentence)
    outputs = sentence.get_spans('pos')
    for output in outputs:
        outfile.write(output + "\n")
    outfile.write("\n")