turkish-nlp-suite / turkish-spacy-models

Repo for spaCy Turkish model development.
Creative Commons Attribution Share Alike 4.0 International
58 stars 3 forks source link

Case & punctuation independent NER detection #1

Closed alteraa closed 2 months ago

alteraa commented 1 year ago

Hi! First of all, thank you for this amazing repo :) While spending time with the medium model, I noticed this:

>>> import spacy
>>> nlp = spacy.load("tr_core_news_md")
>>> nlp("Erzurum'da okullar tatil mi?").ents
(Erzurum'da,)
>>> nlp("erzurumda okullar tatil mi").ents
()

Is it possible to extract entities without case and punctuation dependency?

I am a newby on NLP btw, sorry if this is an unrelated question.

DuyguA commented 1 year ago

Thanksi :blush:

NER in general is quite capitalization dependent model. I trained NER on unpunctuated, uncased corpora with uncased base models before and the outcome was just poor. Even the biggest LLMs such as BloomZ performs so-so on uncased text.

Remedy is to use another pipe component to make capitalization and punctuation correctly. I have such work, will publish it a bit after summer :wink:

atasoglu commented 1 year ago

Here is a quick and dirty solution with trucase package:

First, train a caser for true casing:

import wikipedia
import nltk
from nltk.tokenize import sent_tokenize
from truecase import Trainer

# set language as turkish
wikipedia.set_lang('tr')

# select suitable documents for your future entities 
doc_titles = ["Erzurum", "Okul", "Eğitim", "Tatil", "Resmi Tatil"]

# get docs
documents = [wikipedia.page(title).content for title in doc_titles]

# prepare corpus by tokenize documents
nltk.download('punkt')
corpus = []
for doc in documents:
    sents = sent_tokenize(text=doc, language='turkish')
    corpus += [sent.split() for sent in sents]

# train your caser 
trainer = Trainer()
trainer.train(corpus)
trainer.save_to_file("tr.dist")

Then, pass sentence to the model by fixing its case:

import re
import spacy
from truecase import TrueCaser

# removes '-de', '-da', '-te', '-ta' postfix for more clean entities
def remove_postfix(sentence):
    ptn = re.compile(r"\b(\w+)(de|da|te|ta)\b", flags=re.IGNORECASE | re.UNICODE)
    return ptn.sub(r"\g<1>", sentence)

# import turkish model
nlp = spacy.load("tr_core_news_md")

# import trained caser
caser = TrueCaser("tr.dist")

uncased = "erzurumda okullar tatil mi"
print(f"uncased: {uncased}")

cased = caser.get_true_case(uncased)
print(f"cased: {cased}")

clean_cased = remove_postfix(cased)
print(f"clean_cased: {clean_cased}")

# extract entities
entities = nlp(clean_cased).ents
print(f"entites: {entities}")

Output:

uncased: erzurumda okullar tatil mi
cased: Erzurumda okullar tatil Mi    
clean_cased: Erzurum okullar tatil Mi
entites: (Erzurum,)