Discrepancy between how NER Annotator and spaCy are handling certain Unicode characters

elifbeyzatok00 commented 1 month ago

I wanted to display a json file labeled with spacy displacy. But the problem persists.

I carefully label in the tool:

When I view it with spacy displacy, irrelevant places are labeled, but the places that should be are not:

The code that I used to view labeled text with spacy displacy:

import json
import spacy
from spacy import displacy

# Spacy modelini yükle
nlp = spacy.load("en_core_web_sm")

# JSON dosyasının yolunu belirtin
file_path = "/content/annotations.json"

# JSON dosyasını açıp verileri yükleyin
with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

    if 'annotations' in data:
        for annotation in data['annotations']:
            if annotation is not None:
                text = annotation[0]  # Metin
                entities = [(ent[0], ent[1], ent[2]) for ent in annotation[1]['entities']]  # Varlıklar

                # Displacy için gereken formatta veriyi hazırlayın
                spacy_displacy_data = {
                    "text": text,
                    "ents": [{"start": start, "end": end, "label": label} for start, end, label in entities],
                    "title": None
                }

                # Displacy ile görselleştirme yapın
                displacy.render(spacy_displacy_data, style="ent", manual=True, jupyter=True)

alvi-khan commented 1 month ago

Hello @elifbeyzatok00. Your code seems fine to me. I also tried using it with a small annotation file and it worked as expected.

Could you please provide a sample annotation file for which you're facing the issue so that I can investigate further?

elifbeyzatok00 commented 1 month ago

Thank you for your feedback. @alvi-khan

I've attached the sample file where I encountered the problem so you can investigate further.

sample.zip

alvi-khan commented 1 month ago

Thanks @elifbeyzatok00. I've managed to replicate the issue now.

It seems there's some discrepancy between how NER Annotator and spaCy are handling certain Unicode characters, specifically '🔗' in this case.

If it is acceptable for your use case, an easy workaround is to just replace all instances of '🔗' with two spaces. I've attached a copy of the annotation file you provided in which I have made this replacement.

sample_with_emoji_replaced.zip

As you can see in the attached screenshot, it works correctly after this change.

Screenshot 2024-08-09 003915

alvi-khan commented 1 month ago

For a slightly more technical analysis of why this is happening, it seems that our tokenizer interprets the emoji '🔗' as two characters whereas spaCy interprets it as a single character. This results in the starting position for each entity after the emoji having an 'off by one' error. If we use multiple emojis, the effect becomes cumulative.

I've attached a minimal reproducible example which clearly shows this issue.

Text File: text.txt Annotations: annotations.json

From NER Annotator:

Screenshot 2024-08-09 005326

From spaCy:

For this piece of text, the exported annotation is:

{"classes":["TEST"],"annotations":[["This part is fine - 🔗 - but this part is not.",{"entities":[[5,9,"TEST"],[34,38,"TEST"]]}]]}

Here, the second entity starts from index 34, which means there should be 34 characters in front of it. But if we count the characters (counting '🔗' as a single character), we will see that there are actually 33 characters in front of it. The first entity does not have this issue, correctly starting from index 5.

We can also see that '🔗' is being interpreted as two characters if we switch the annotation precision to 'Character Level'.

I'll need some time to properly understand why this discrepancy exists and how to resolve it. @tecoholic, since this is related to the tokenization process, I would appreciate any hints you might be able to provide.

tecoholic commented 1 month ago

@alvi-khan Thank for the thorough investigation of the issue. Can you kindly see if the NLTK Tokenizer in Python also produces the same effect? As in, does it also count the Unicode as 2 characters? Since the JS tokenizer we use is a port of the NLTK tokenizer, I suspect that would be the case.

If it turns out the NLTK tokenizer also has the same issue, then we will need to update our tokenizer to follow the Spacy Tokenizer as this is after all NER Annotator for Spacy.

Sidenote: This might be a good update to the software, we might end up non-english annotations properly as well.

elifbeyzatok00 commented 1 month ago

@alvi-khan @tecoholic Thank you very much for all your help. I will clean the emojis before exporting the txt files to the NER Annotation Tool. In this way, I will prevent character shifts caused by emojis.

I'm impressed that you responded so quickly and investigated the issue thoroughly. You have a great team. Thank you very much again.🤩

alvi-khan commented 1 month ago

I had a feeling this was an encoding issue.

In the JS port:

const TreebankTokenizer = require('treebank-tokenizer');

tokenizer = new TreebankTokenizer();

console.log(tokenizer.span_tokenize("🔗"))

Output: [ [ 0, 2 ] ]

In Python:

from nltk.tokenize import TreebankWordTokenizer

list(TreebankWordTokenizer().span_tokenize("🔗"))

Output: [(0, 1)]

@tecoholic, you probably already know this, but for reference, the span_tokenize method in both the JS port and the original Python version call a align_tokens method, which can be found in the utils package for both. This in turn uses the length of each token to determine the start and end of the span. This is where our problems start.

In JS "🔗".length == 2 but in Python len("🔗") == 1.

A detailed discussion on why this occurs is available here.

We can of course modify the the align_tokens method in the JS port to force it to always give the same results as the Python variant by counting the unicode scalar values instead (which is how Python determines string lengths). In JS, the Array.from method does this for us, so Array.from("🔗").length == 1. A better example may be the '🤦🏼‍♂️' emoji, for which Array.from("🤦🏼‍♂️").length == 5. This is the same in Python, where len("🤦🏼‍♂️") == 5.

We need to decide whether we want to make this change or leave it as a known issue that we won't (or rather can't) fix. Since the issue, fundamentally, occurs due to a difference in how the two languages handle strings, changing the behavior may break things in unexpected and confusing ways.

There might also be an alternative approach that allows us to handle the issue in the Python script, but I can't think of one at the moment.

tecoholic / ner-annotator

Discrepancy between how NER Annotator and spaCy are handling certain Unicode characters #119