Closed aphillips closed 1 year ago
Name | Link |
---|---|
Latest commit | 5b1fc297fb39231be37e171691d06b52754b55db |
Latest deploy log | https://app.netlify.com/sites/string-search-w3c/deploys/6388eca559623f000891b965 |
Deploy Preview | https://deploy-preview-14--string-search-w3c.netlify.app/ |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site settings.
Have a look. Thinking we might merge this in and then get comments.
Do you have a handy example of Arabic with/without short vowels (to use in Example 7)?
@r12a Could you have a look at this again? Note in particular the Gujarati example. If you have an example of a Gujarati word with varied handling thanks to anuswara/visarga/etc. that would be helpful. (Or other examples in other languages). I think this is close to being ready to commit ?
In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used เค [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use เค [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.
I think this gives the wrong impression. The anusvara in 'hindi' represents an 'n' sound, rather than nasalisation. Candrabindu is a nasalisation mark, and not an alternative for a syllable-final 'n' (there is no nasalisation in 'hindi'). In principle, the anusvara is an alternative for candrabindu when space is constrained, but not the other way around. That said, it does sometimes appear to be the case that some words are spelled as if they were nasalised, or alternatively as if they had a syllable-final nasal, eg. snake at https://r12a.github.io/scripts/devanagari/hi.html#nasalisation I don't believe that applies to the word 'hindi' though.
The alternative use of either a letter or a diacritic for syllable-final nasals is common to many other Indian languages. In addition to Devanagari (used to write languages such as Hindi (language tag hi) or Marathi (language tag mr), scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.
[1] i'd say 'several' rather than 'many'
[2] your parens are not properly balanced
[3] The Gujarati example is basically the same as the Devanagari one. Is there not a different kind of spelling difference in the CDAC doc that we can use?
I think this gives the wrong impression. The anusvara in 'hindi' represents an 'n' sound, rather than nasalisation. Candrabindu is a nasalisation mark, and not an alternative for a syllable-final 'n' (there is no nasalisation in 'hindi'). In principle, the anusvara is an alternative for candrabindu when space is constrained, but not the other way around. That said, it does sometimes appear to be the case that some words are spelled as if they were nasalised, or alternatively as if they had a syllable-final nasal, eg. snake at https://r12a.github.io/scripts/devanagari/hi.html#nasalisation I don't believe that applies to the word 'hindi' though.
Super useful. I'll adjust the example (borrowing from your document)
Users may also create visually identical (or very similar) graphemes from sequences of characters that are deprecated or unexpected by the Unicode Standard. For example, in some fonts it is possible to create something that looks like the independent vowel /au/ using the (normal) เฎ [U+0B94 TAMIL LETTER AU], or by typing two inappropriate individual letters, เฎเฎณ [U+0B92 TAMIL LETTER O + U+0BB3 TAMIL LETTER LLA]. The latter should by avoided by users, but applications will need to decide whether or not to match such aberrations if they appear in the text.
The alternatives you use as examples are neither deprecated nor unexpected by Unicode. They are canonically equivalent precomposed vs decomposed instances. So, even if Unicode expresses a preference for precomposed it's not really essential to choose one rather than the other, and normalisation will resolve the matching problem.
A much better example would be something like the vowel-sign constructions shown at https://r12a.github.io/scripts/devanagari/hi.html#vowelsign_encoding, https://r12a.github.io/scripts/bengali/bn.html#vowelsign_encoding2 or https://r12a.github.io/scripts/malayalam/ml.html#vowelsign_encoding2 (the first one at the last link is particularly good)
Confusables or spelling errors these can be common in certain kinds of text due to gaps in keyboard support or due to a similarity in appearance
That particular example (with YEH) is a problematic one. A much better example might be โค at https://r12a.github.io/scripts/arabic/ks.html#confusables (though it may need a little more explanation). See the note for โค. You'll need the noto nastaliq font with the language set to ks to see the right glyph for the sukun (circumflex). Examples of that confusion/workaround are easy to find in the wild.
[3] The Gujarati example is basically the same as the Devanagari one. Is there not a different kind of spelling difference in the CDAC doc that we can use?
I don't know. Most of the examples seem to be about nasals, but some are unlabeled.
The alternatives you use as examples are neither deprecated nor unexpected by Unicode.
I think this text is verbatim from you? ๐ I'll change to the new example you suggest.
I think this text is verbatim from you? ๐ I'll change to the new example you suggest.
Oh, if it is then let me know where you got it from. Here's the appropriate section in my Tamil notes: https://r12a.github.io/scripts/tamil/ta.html#standalone_encoding I don't see that text anywhere else in the Tamil page.
@r12a Another look? I'd like to start the year with this merged in... ;-)
Merging per telecon of 2022-12-15
Edits to address my action item. In this case, borrowing from @r12a's Kashmiri document (plus a few other minor edits).
@r12a Please review for veracity!