Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected

mdevolde commented 3 months ago

While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.

Take this code for example:

import language_tool_python

def patch_text(text):
    with language_tool_python.LanguageTool('en-US') as tool:
        errors = tool.check(text)
    patched_text = language_tool_python.utils.correct(text, errors)
    return patched_text

if __name__ == '__main__':
    text = """
The sun was seting 🌅, casting a warm glow over the park. Birds chirpped softly 🐦 as the day slowly fade into night.
    """
    print(patch_text(text))

At present, in v2.8, the result is as follows: The sun was setting 🌅, casting a warm glow over the park. Birds cchippedsoftly 🐦 as the day slowly fade into night. Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1. So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right. The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.

With my update, here's the result: The sun was setting 🌅, casting a warm glow over the park. Birds chipped softly 🐦 as the day slowly fade into night.

I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.

jxmorris12 commented 3 months ago

Thank you!

mdevolde commented 3 months ago

@jxmorris12 In fact, this issue should be resolved with the corrections applied:

Offset position "longer" than text #83

jxmorris12 / language_tool_python

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94