While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.
Take this code for example:
import language_tool_python
def patch_text(text):
with language_tool_python.LanguageTool('en-US') as tool:
errors = tool.check(text)
patched_text = language_tool_python.utils.correct(text, errors)
return patched_text
if __name__ == '__main__':
text = """
The sun was seting π , casting a warm glow over the park. Birds chirpped softly π¦ as the day slowly fade into night.
"""
print(patch_text(text))
At present, in v2.8, the result is as follows:
The sun was setting π , casting a warm glow over the park. Birds cchippedsoftly π¦ as the day slowly fade into night.
Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1.
So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right.
The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.
With my update, here's the result:
The sun was setting π , casting a warm glow over the park. Birds chipped softly π¦ as the day slowly fade into night.
I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.
While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.
Take this code for example:
At present, in v2.8, the result is as follows:
The sun was setting π , casting a warm glow over the park. Birds cchippedsoftly π¦ as the day slowly fade into night.
Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1. So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right. The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.With my update, here's the result:
The sun was setting π , casting a warm glow over the park. Birds chipped softly π¦ as the day slowly fade into night.
I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.