Closed alteraa closed 2 months ago
Thanksi :blush:
NER in general is quite capitalization dependent model. I trained NER on unpunctuated, uncased corpora with uncased base models before and the outcome was just poor. Even the biggest LLMs such as BloomZ performs so-so on uncased text.
Remedy is to use another pipe component to make capitalization and punctuation correctly. I have such work, will publish it a bit after summer :wink:
Here is a quick and dirty solution with trucase package:
First, train a caser for true casing:
import wikipedia
import nltk
from nltk.tokenize import sent_tokenize
from truecase import Trainer
# set language as turkish
wikipedia.set_lang('tr')
# select suitable documents for your future entities
doc_titles = ["Erzurum", "Okul", "Eğitim", "Tatil", "Resmi Tatil"]
# get docs
documents = [wikipedia.page(title).content for title in doc_titles]
# prepare corpus by tokenize documents
nltk.download('punkt')
corpus = []
for doc in documents:
sents = sent_tokenize(text=doc, language='turkish')
corpus += [sent.split() for sent in sents]
# train your caser
trainer = Trainer()
trainer.train(corpus)
trainer.save_to_file("tr.dist")
Then, pass sentence to the model by fixing its case:
import re
import spacy
from truecase import TrueCaser
# removes '-de', '-da', '-te', '-ta' postfix for more clean entities
def remove_postfix(sentence):
ptn = re.compile(r"\b(\w+)(de|da|te|ta)\b", flags=re.IGNORECASE | re.UNICODE)
return ptn.sub(r"\g<1>", sentence)
# import turkish model
nlp = spacy.load("tr_core_news_md")
# import trained caser
caser = TrueCaser("tr.dist")
uncased = "erzurumda okullar tatil mi"
print(f"uncased: {uncased}")
cased = caser.get_true_case(uncased)
print(f"cased: {cased}")
clean_cased = remove_postfix(cased)
print(f"clean_cased: {clean_cased}")
# extract entities
entities = nlp(clean_cased).ents
print(f"entites: {entities}")
Output:
uncased: erzurumda okullar tatil mi
cased: Erzurumda okullar tatil Mi
clean_cased: Erzurum okullar tatil Mi
entites: (Erzurum,)
Hi! First of all, thank you for this amazing repo :) While spending time with the medium model, I noticed this:
Is it possible to extract entities without case and punctuation dependency?
I am a newby on NLP btw, sorry if this is an unrelated question.