Tokenizer treats an alphabetic character as a word-delimiter

projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection

https://endings.uvic.ca/staticSearch/docs/index.html

Mozilla Public License 2.0

50 stars 22 forks source link

Tokenizer treats an alphabetic character as a word-delimiter #300

Open martindholmes opened 5 months ago

martindholmes commented 5 months ago

The codepoint U+A78F:

https://util.unicode.org/UnicodeJsps/character.jsp?a=A78F

(LATIN LETTER SINOLOGICAL DOT) is in the Latin Extended D block, and is Alphabetic, and in the Other_Letter category; Wikipedia explains "A middot may be used as a consonant or modifier letter, rather than as punctuation, in transcription systems and in language orthographies. For such uses Unicode provides the code point U+A78F ꞏ LATIN LETTER SINOLOGICAL DOT.[16]".

It's being proposed for use in this way (as a consonant to signal length) in Wendat orthography. However, our tokenizer currently treats it as a word-break character; I think this is a bug. It could be a bug in the regex in the tokenizer, or in the Java Unicode regex handling; the character is new enough in Unicode (2015) that the problem could just be that the code hasn't caught up. If so, I think we should special-case it.

martindholmes commented 5 months ago

This seems to be a bug in Saxon or Java, because both of these test false:

matches('ꞏ', '\p{L}') matches('ꞏ', '\p{L}')

I think the best thing to do for now is to add this character explicitly to the regex for alphanumerics.

martindholmes commented 5 months ago

Fix and test for it committed in branch iss-300-sindot. PR #301 created.

martindholmes commented 5 months ago

Martin Honnen pointed me at the Saxon documentation which says that it's still using Unicode 6 tables:

https://www.saxonica.com/html/documentation12/conformance/xpath31.html

So that would explain it, if the documentation is up to date.

martindholmes commented 4 months ago

I think this issue is complete, but only through the ad-hoc hack of adding the specific character concerned into the regex. Somehow or other, we should keep this around to remind ourselves that when Saxon 12.5 comes out, we need to move to it, and remove the hack.

martindholmes commented 4 weeks ago

Note: Saxon 12.5 was released in July, so I'll add a ticket for upgrading to it, and link it to this ticket. If the upgrade goes smoothly we should be able to test the removal of this hack.

martindholmes commented 1 week ago

Saxon 12.5 now merged, so this can be tested and the hack removed if no longer required.