w3c / string-search

Parking lot for advice on internationalization related string searching in general content
https://w3c.github.io/string-search/
3 stars 10 forks source link

Updates including Kashmiri examples from Richard #14

Closed aphillips closed 1 year ago

aphillips commented 2 years ago

Edits to address my action item. In this case, borrowing from @r12a's Kashmiri document (plus a few other minor edits).

@r12a Please review for veracity!

netlify[bot] commented 2 years ago

Deploy Preview for string-search-w3c ready!

Name Link
Latest commit 5b1fc297fb39231be37e171691d06b52754b55db
Latest deploy log https://app.netlify.com/sites/string-search-w3c/deploys/6388eca559623f000891b965
Deploy Preview https://deploy-preview-14--string-search-w3c.netlify.app/
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

aphillips commented 2 years ago

Have a look. Thinking we might merge this in and then get comments.

Do you have a handy example of Arabic with/without short vowels (to use in Example 7)?

aphillips commented 2 years ago

@r12a Could you have a look at this again? Note in particular the Gujarati example. If you have an example of a Gujarati word with varied handling thanks to anuswara/visarga/etc. that would be helpful. (Or other examples in other languages). I think this is close to being ready to commit ?

r12a commented 2 years ago

In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used เค‚ [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use เค [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.

I think this gives the wrong impression. The anusvara in 'hindi' represents an 'n' sound, rather than nasalisation. Candrabindu is a nasalisation mark, and not an alternative for a syllable-final 'n' (there is no nasalisation in 'hindi'). In principle, the anusvara is an alternative for candrabindu when space is constrained, but not the other way around. That said, it does sometimes appear to be the case that some words are spelled as if they were nasalised, or alternatively as if they had a syllable-final nasal, eg. snake at https://r12a.github.io/scripts/devanagari/hi.html#nasalisation I don't believe that applies to the word 'hindi' though.

r12a commented 2 years ago

The alternative use of either a letter or a diacritic for syllable-final nasals is common to many other Indian languages. In addition to Devanagari (used to write languages such as Hindi (language tag hi) or Marathi (language tag mr), scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.

[1] i'd say 'several' rather than 'many'

[2] your parens are not properly balanced

[3] The Gujarati example is basically the same as the Devanagari one. Is there not a different kind of spelling difference in the CDAC doc that we can use?

aphillips commented 2 years ago

I think this gives the wrong impression. The anusvara in 'hindi' represents an 'n' sound, rather than nasalisation. Candrabindu is a nasalisation mark, and not an alternative for a syllable-final 'n' (there is no nasalisation in 'hindi'). In principle, the anusvara is an alternative for candrabindu when space is constrained, but not the other way around. That said, it does sometimes appear to be the case that some words are spelled as if they were nasalised, or alternatively as if they had a syllable-final nasal, eg. snake at https://r12a.github.io/scripts/devanagari/hi.html#nasalisation I don't believe that applies to the word 'hindi' though.

Super useful. I'll adjust the example (borrowing from your document)

r12a commented 2 years ago

Users may also create visually identical (or very similar) graphemes from sequences of characters that are deprecated or unexpected by the Unicode Standard. For example, in some fonts it is possible to create something that looks like the independent vowel /au/ using the (normal) เฎ” [U+0B94 TAMIL LETTER AU], or by typing two inappropriate individual letters, เฎ’เฎณ [U+0B92 TAMIL LETTER O + U+0BB3 TAMIL LETTER LLA]. The latter should by avoided by users, but applications will need to decide whether or not to match such aberrations if they appear in the text.

The alternatives you use as examples are neither deprecated nor unexpected by Unicode. They are canonically equivalent precomposed vs decomposed instances. So, even if Unicode expresses a preference for precomposed it's not really essential to choose one rather than the other, and normalisation will resolve the matching problem.

A much better example would be something like the vowel-sign constructions shown at https://r12a.github.io/scripts/devanagari/hi.html#vowelsign_encoding, https://r12a.github.io/scripts/bengali/bn.html#vowelsign_encoding2 or https://r12a.github.io/scripts/malayalam/ml.html#vowelsign_encoding2 (the first one at the last link is particularly good)

r12a commented 2 years ago

Confusables or spelling errors these can be common in certain kinds of text due to gaps in keyboard support or due to a similarity in appearance

That particular example (with YEH) is a problematic one. A much better example might be โ‘ค at https://r12a.github.io/scripts/arabic/ks.html#confusables (though it may need a little more explanation). See the note for โ‘ค. You'll need the noto nastaliq font with the language set to ks to see the right glyph for the sukun (circumflex). Examples of that confusion/workaround are easy to find in the wild.

aphillips commented 2 years ago

[3] The Gujarati example is basically the same as the Devanagari one. Is there not a different kind of spelling difference in the CDAC doc that we can use?

I don't know. Most of the examples seem to be about nasals, but some are unlabeled.

aphillips commented 2 years ago

The alternatives you use as examples are neither deprecated nor unexpected by Unicode.

I think this text is verbatim from you? ๐Ÿ˜‰ I'll change to the new example you suggest.

r12a commented 2 years ago

I think this text is verbatim from you? ๐Ÿ˜‰ I'll change to the new example you suggest.

Oh, if it is then let me know where you got it from. Here's the appropriate section in my Tamil notes: https://r12a.github.io/scripts/tamil/ta.html#standalone_encoding I don't see that text anywhere else in the Tamil page.

aphillips commented 1 year ago

@r12a Another look? I'd like to start the year with this merged in... ;-)

aphillips commented 1 year ago

Merging per telecon of 2022-12-15