languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.33k stars 1.39k forks source link

Letter Apostrophe U+02BC is incorrectly rejected by the spellchecker in Belarusian texts #8366

Open ssvb opened 1 year ago

ssvb commented 1 year ago

Copy-paste the sentence "З'ява, зʼява, з’ява" at https://languagetool.org/ in order to spellcheck it in Ukrainian and Belarusian. Observe the following outcome:

goodukr

badbel

Here the U+02BC apostrophe is incorrectly rejected for Belarusian. The U+0027 and U+2019 apostrophes are fine.

ssvb commented 1 year ago

Here's an example of a public domain Belarusian book, which uses U+02BC apostrophes: https://knihi.com/Jakub_Kolas/Malady_dubok.html The spellchecker is unable to validate the following sentence taken from there: "Так і за сталом жыцця, дзе сядзіць цесная людская сямʼя, не пустуе месца выхвачанага смерцю семʼяніна."

ssvb commented 1 year ago

Moreover, "Правілы беларускай арфаграфіі і пунктуацыі (2008)" ("The rules of the Belarusian orthography and punctuation (2008)") explain the usage of the apostrophe in the same chapter as the soft sign "ь". And don't mention it in the chapter dedicated to punctuation. The apostrophe is not related to punctuation and is more like the Russian hard sign "ъ" letter.

The Unicode spec differentiates between a "letter apostrophe" and a "punctuation apostrophe". There's "Apostrophes" section in http://www.unicode.org/versions/Unicode15.0.0/ch06.pdf#G12411 with authoritative explanations. Basically, U+02BC is the right apostrophe for the Belarusian language. But the U+2019 apostrophe may be also encountered in the real world because of mapping from other character sets and because of users’ failure to strictly follow the standard. So the U+2019 apostrophe is to be considered ambiguous and context dependent.