w3c / iip

Documenting gaps and requirements for support of Indic languages on the Web and in eBooks.
https://w3c.github.io/iip/
8 stars 15 forks source link

Gurmukhi Notes #46

Open kulpreetchilana opened 5 years ago

kulpreetchilana commented 5 years ago

This is mainly to track any difference / nuances between Devanagari and Gurmukhi in regard to their gap analysis. My expectation is that most technical work required to support Devanagari will generalize to Gurmukhi as well, so most of the discussion regarding Gurmukhi can inherit from Devanagari, except for where noted below:

Of interest to us as we look for usage specimen: Punjab Digital Library has a treasure trove of old and contemporary documents in both Gurmukhi and Devanagari. We can definitely look there when investigating.

2.1. Encoding Considerations

Gurmukhi is unique from other Indic scripts in that it has independent vowels, namely, ੳ U+0A73, ਅ U+0A05 and ੲ U+0A72. ੳ U+0A73 and ੲ U+0A72 have no inherent sound and require attaching to a dependent vowel (ex: ੁ U+0A41).

For compatibility with other Indic scripts, ਉ U+0A09 exists as a single code point in Unicode. i.e. ਉ U+0A09 occupies the same spot in the Gurmukhi plane as उ U+0909 in the Devanagari (which does not have independent vowels) plane.

This causes great confusion for Punjabi / Gurmukhi users who would expect ਉ U+0A09 and the combinations of ੳ U+0A73 with ੁ U+0A41 to be equivalent—but they are not. We should get some sort of compatibility equivalence for these characters or at least treat them as equivalent for the purposes of sorting, search, collation, etc.

    ਉ U+0A09 = ੳ U+0A73 + ੁ U+0A41
    ਊ U+0A0A = ੳ U+0A73 + ੂ U+0A42
    ਓ U+0A13 = ੳ U+0A73 + ੋ U+0A4B
    ਆ U+0A06 = ਅ U+0A05 + ਾ U+0A3E
    ਐ U+0A10 = ਅ U+0A05 + ੈ U+0A48
    ਔ U+0A14 = ਅ U+0A05 + ੌ U+0A4C
    ਇ U+0A07 = ੲ U+0A72 + ਿ U+0A3F
    ਈ U+0A08 = ੲ U+0A72 + ੀ U+0A40
    ਏ U+0A0F = ੲ U+0A72 + ੇ U+0A47

There seems to be a decent amount of data generated using the incorrect sequence. ex: Searching ਅਾਲੂ returns 2450 results on Google, while searching ਆਲੂ returns 131,000 results (~2%)

2.3. Font styles

Should follow from discussion in #39. I’ve commented there with some Gurmukhi “italic-equivalent“ specimen.

2.6. Quotations

Should follow discussion and resolution in #29. Quotation marks seem to be accepted in modern Gurmukhi texts. I’ve found usage to reflect that of Latin in newspaper specimens as early as 1900s to modern-day Punjabi newspapers.

2.8. Text boundaries & selection

This behavior varies even between different applications within macOS. Apple Pages seems to delete / select / cursor by Vowel+Consonant cluster. TextEdit seems to delete by cod point, but cursor / select is by Vowel + Consonant cluster. In general, the TextEdit behavior seems favorable.

Gurmukhi has only three commonly used half/subjoined consonants for ਹ ਵ and ਰ. Most Gurmukhi readers are not familiar with the concept of Halant / Virama and thus deleting behavior should delete the full half-character.

2.9. Transforming characters

Certain older Gurmukhi texts do not contain space characters. This form is known as Larivaar and some website allow you to toggle it as a setting. It might make sense for this to be a transform property on the text, so that way we can preserve the words for line-breaking.

2.13. Emphasis & highlights

Bold and Italic have become commonplace for Gurmukhi publications. See examples in #39.

3.1. Line breaking

See note above about Larivaar

3.2. Hyphenation

Hyphenation is used in words such as ਅੱਜ-ਕਲ੍ਹ. I’m guessing we wouldn’t want to break mid-word, which seems to be the current behavior in many applications.

3.6. Baselines & inline alignment

Generally, Indian scripts that have joining line use the joining line as the baseline. Ideally, if a document contains both Devanagari and Gurmukhi text (such as Mahan Khosh), the text should be aligned at the joining line regardless of the script or font.

4.1. Bidirectional layout

Given that Gurmukhi is used to write the Punjabi language and the Persio-Arabic script is also used to write the Punjabi language in West Punjab—we need to ensure that bidi behavior functions as expected. I don’t see why it wouldn’t in current implementations, but just thought I'd note it here.

4.3. Notes, footnotes, etc.

Western-style footnotes seems to be accepted as common. Harvard University Press’s Sufi Lyrics — Bulleh Shah makes use of them with Gurmukhi numerals

vivekpani commented 5 years ago

@kulpreetchilana This is a perfect set of inputs with amazingly perfect understanding. I am very happy to have someone comment with such understanding. Unicode does not provide "any" useful guidelines about what must make it into the encoding. A purpose that an encoding is done is also to make the script and language usable with ease and unambiguously.

The issues with the independent vowels you have highlighted are the same ones I have debated endlessly in LITD meetings in BIS which releases the IS 16350 standard (borrowing from Unicode). ISCII (from which Unicode borrowed in it's first version) did not encode the independent vowels precisely for this same reason. They do not have any independent pronunciations and hence, are not used in texts without the dependent vowels attached. To avoid confusion and to have uniformity in text for computing (search, sort, etc.), these were not encoded separately but in the joined forms with the dependent vowels.

The lack of such guidelines specific to phonetic scripts in Unicode has led to such display ambiguities in all scripts and many fragments have found encoded places.

Font styles - Another uniqueness about Gurmukhi script is that there are no conjuncts. The only cases are doubling of a consonant which is represented with the addak and not by any conjunct. This makes the writing of the script a lot simpler compared to Devanagari or other Indian scripts (except Tamil). Hence, stylised writing like Italics doesn't make it difficult to read. In other scripts where conjuncts stack (as in Devanagari or phalas in Odia, Bangla or vattus in the Dravidian scripts) make some conjuncts unreadable or confusing. That may be the reason such styling is not seen in print in these scripts. The examples you presented could also be a font-type (like cursive for Latin which slants) than actual italicising of any specific font style. I am not an expert but know some font makers and can take opinion.