giellalt / lang-crk

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Plains Cree language
https://giellalt.uit.no
Other
14 stars 1 forks source link

Hyphens and other non-alphabetic characters as legit parts of words treated incorrectly as word boundaries (Bugzilla Bug 2641) #41

Open albbas opened 4 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2641

Date: 2019-11-16T02:44:29+01:00 From: @arppe@ualberta.ca To: Børre Gaup <>

Last updated: 2019-11-16T02:44:29+01:00

albbas commented 4 years ago

Comment 13816

Date: 2019-11-16 02:44:29 +0100 From: @arppe@ualberta.ca

The last time I checked MS Word on Windows (on Sjur Mac running a Windows OS), the demo crk spell checker (or Word) treated hyphens as word boundaries, when in actual fact hyphens are an integral part of well-written SRO crk words.

Examples of words that should be recognized:

a. Non-hyphenated: êkota êwako ispîhk kistapinânihk mistahi mâna namôya nitiskonikanihk ohci ohpimê

b. Hyphenated kâ-kî-awâsisîwiyân nikî-nitawi-kiskinwahamâkosin kâ-kî-nitawi-kiskinwahamâkosiyân ê-kî-itohtahikawiyân nikî-kitimâkihikawinân niwî-âtotên niwî-âcimâwak

While this applies to crk, there are similar issues in e.g. Mohawk, where the colon ':' should be allowed as an integral part of a word (denoting long phonemes).

Sjur tells me that this might have been resolved generally with the Divvun speller engine (?) using the character set of the speller FST as a basis for defining what words are. Nevertheless, I'm reporting this as an explicit issue so that the previous incorrect behavior is registered and that there are example cases to check that it has been properly resolved (now and later on).