Open PeterPirog opened 2 years ago
I think I have some jupyter notebooks which I used for my experiments and analysis. I will try to clean the code a bit and upload it whenever I can! Meanwhile if you have a urgent need for it you can take a look at the FLAIR toolkit documentation, since the Morphological Analysis model is trained using the FLAIR toolkit.
@pranaydeeps Hi, thank You for the answer. This isn't urgent so I will wait patiently and try to do something myself. Now I use code:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
###Tokenize the sentences like before:
sent = [
" Ἀπὸ δὲ ἕκτης ὥρας σκότος ἐγένετο ἐπὶ πᾶσαν τὴν γῆν ἕως ὥρας ἐννάτης", # Mt 27.45 sentence 1
" γενομένης δὲ ὥρας ἕκτης σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης", # Mk 15.33 sentence 2
" Οἱ δὲ παραπορευόμενοι ἐβλασφήμουν αὐτὸν κινοῦντες τὰς κεφαλὰς αὐτῶν", # Mt 27.39 sentence 3
" ἦν δὲ ὡσεὶ ὥρα ἕκτη Καὶ σκότος ἐγένετο ἐφ᾽ ὅλην τὴν γῆν ἕως ὥρας ἐννάτης" # Lk23.44 sentence 4
]
# sentence 1 to senetnce 2 = 0.94861794
# sentence 1 to senetnce 3 = 0.6118592
# sentence 1 to senetnce 4 = 0.9064161
# initialize dictionary: stores tokenized sentences
token = {'input_ids': [], 'attention_mask': []}
for sentence in sent:
# encode each sentence, append to dictionary
new_token = tokenizer.encode_plus(sentence, max_length=128,
truncation=True, padding='max_length',
return_tensors='pt')
token['input_ids'].append(new_token['input_ids'][0])
token['attention_mask'].append(new_token['attention_mask'][0])
# reformat list of tensors to single tensor
token['input_ids'] = torch.stack(token['input_ids'])
token['attention_mask'] = torch.stack(token['attention_mask'])
# Process tokens through model:
output = model(**token)
print(output.keys())
# The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor
embeddings = output.last_hidden_state
print(embeddings)
# To perform this operation, we first resize our attention_mask tensor:
att_mask = token['attention_mask']
att_mask.shape
# output: torch.Size([4, 128])
mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape
# Output: torch.Size([4, 128, 768])
mask_embeddings = embeddings * mask
mask_embeddings.shape
# Output: torch.Size([4, 128, 768])
# Then we sum the remained of the embeddings along axis 1:
summed = torch.sum(mask_embeddings, 1)
summed.shape
# Output: torch.Size([4, 768])
# Then sum the number of values that must be given attention in each position of the tensor:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape
# Output: torch.Size([4, 768])
mean_pooled = summed / summed_mask
print(mean_pooled)
from sklearn.metrics.pairwise import cosine_similarity
# Let's calculate cosine similarity for sentence 0:
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()
# calculate
similarity = cosine_similarity(
[mean_pooled[0]],
mean_pooled[1:])
print(similarity)
but maybe there is some better way to do it. As I know the biggest language corpus is TLG http://stephanus.tlg.uci.edu/ but unfortunatelly even untagged texts aren't open source.
@PeterPirog apologies for the delayed reply. You can use something similar to get Morphological Analysis outputs from the pre-trained model if your text is saved line by line in the file "input_text_clean.txt":
from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')
with open("../input_text_clean.txt", "r") as testfile:
test_list = test.readlines()
outfile = open("../morph_analysis_outputs.txt", "w")
for testitem in test_list:
sentence = Sentence(testitem)
tagger.predict(sentence)
outputs = sentence.get_spans('pos')
for output in outputs:
outfile.write(output + "\n")
outfile.write("\n")
@pranaydeeps Is possible to add simple code example how to do morphological analysis of some short ancient greek sentence? Maybe some examples of:
Thank You very much for the model. I tried to train smaller BERT model but i didn't have enough GPU resources. I would like to use your model for New Testament analysis.