w3c / iip

Documenting gaps and requirements for support of Indic languages on the Web and in eBooks.
https://w3c.github.io/iip/
8 stars 15 forks source link

Issues with ZWJ/ZWNJ #14

Open r12a opened 6 years ago

r12a commented 6 years ago

Vivek noted that:

2.1 the recent proposals on is 16350 discussion, the use of zwj or zwnj for alternate shapes was being discouraged. Not sure if this factors here. The standards document recommended use of zwj and zwnj to represent alternate shapes.

We discussed this on the IIP telecon, and agreed to continue the discussion of ZWJ/ZWNJ here. A summary of the conclusion will probably belong in section 2.1 or 2.4 of the gap analysis.

I believe the question to be answered is: If the use of these characters is being discouraged, does it mean that users have problems creating the content that they need? Please provide examples, and attempt to guage the impact, for Devanagari, Bengali and Tamil scripts.

miloush commented 6 years ago

My question is what does 'discouraged' mean in this context. Do you expect the rendering engines to behave differently on web sites than anywhere else?

r12a commented 6 years ago

Here's an example of a specific case where it appears that ZWJ is needed in Bengali/Assamese.

When U+09AF BENGALI LETTER YA (antaḥstha ya) occurs as the last member of a consonant cluster it has a special shape called ya-phalā.

screen shot 2018-08-01 at 09 44 44

To produce this shape in Unicode, just type the underlying sequence of characters as you would for any other consonant cluster. For example, ত্য at the end of the word সাহিত্য "literature" is written <ta, hasant, ya>. The font should produce the correct shape.

If ya follows ra in a consonant cluster, the font will normally produce the reph over the full form of ya, as in পর্যন্ত "until". On the rare occasions when you want to retain the ya-phalā shape when ya follows ra, e.g. র‍্যাংকিং, add U+200D ZERO WIDTH JOINER before the hasant.

(Actually, sometimes ZWNJ is used, but the Unicode Standard FAQ recently encouraged the use of ZWJ instead, since ZWNJ is used for a different shaping control.)

There are a number of words in Bengali where it seems that ZWJ would be needed for this particular reason. How are people expected to type those words?

miloush commented 6 years ago

Thanks for the example. My point was a bit more high-level, i.e. what are you trying to achieve with the question. Is the idea to come up with an argument against 16350 discouraging the use of ZWJ/ZWNJ?

r12a commented 6 years ago

@miloush my initial post was just to get the discussion going. I'm relying on the indian experts in the group to put some flesh around the situation wrt use or not of these characters. I'm hoping we'll get clarity on the proposal of 16350, and discuss the pros and cons of that, but also more generally understand the value of ZWJ/ZWNJ for these scripts and document any barriers to use of the Web arising from any of the above.

vivekpani commented 6 years ago

There has been no progress in IS 16350 yet. I was a part of a committee discussion where it was brought out that the use of ZWJ and ZWNJ for alternate shapes must be discouraged. It doesn't impact browser rendering specifically but rendering of Indic scripts in general where alternate conjunct shapes are common but produced by using ZWJ/ZWNJ. The example of Bengali Richard has given is similar to the one in Odia (my mother tongue) as well. The ya-phala is called "jya-phala" in both the languages (and Assamese as well). When joined with ta (0x0B24) as in satya (ସତ୍ୟ) or ka(0x0B15) as in bakya (ବାକ୍ୟ), the pronunciation is only ya. But, when joined with ra (0x0B30) as in karya (କାର୍ୟ), the pronunciation is jya but in both cases, it is called jya phala. This phala has also been used to match the English vowel pronunciations as in words like cat, bank etc. which are written as କ୍ୟାଟ୍ and ବ୍ୟାଟ respectively. The ya-phala with aa-matra pronunciation is used for the pronunciation of "a" in cat or bank. However, the conjunct with ra (0x0B24) is typical as ra forms reph when it is the first consonant in a conjunct. So, in a word like ranking, the ra will not take the ya-phala (similar to example of karya above). But, the pronunciation in karya is karjya. A "ra" starting conjunct doesn't start any word in Indic vocab. So, writing an English word in this way becomes an exception and the "jya" pronunciation doesn't fit for words like ranking. So, instead of seeing the conjunct rya which will get pronounced as rjya right at the start of a word like ranking, it has been preferred to write full ra with a ya-phala (jya-phala). This new conjunct will at least not confuse a reader and will be familiar with the phala that joins ta(0x0B24), ka(0x0B15) etc.

Hence, this is "not" a conjunct or consonant issue and adding as a rule will neither keep this as an exception nor truly resolve this need very well. In Devanagari, this was resolved by introducing two new vowels that were used only in Marathi speaking regions to write those English pronunciations.

We may wait for IS 16350 updates but given that this is non-native to the languages and is exception to express only foreign words starting with ra (0x0B30) and taking the pronunciation of a as in cat, I am finding it a bit too harsh to from intrinsic script grammar for these languages. The script grammar doesn't allow consonants to join vowels. So, a phala doesn't really join a vowel. And ra becomes a reph whenever it is the first consonant in any conjunct. A more acceptable form of writing is to just use that aa-matra (ରାଂକିଂ)

r12a commented 6 years ago

Along theoretical lines, i have no opinion for or against. But it seems that currently people do write borrowed words starting with 'ræ' rather than 'ra', using র‍্যা rather than রা and the former appears to be the de facto way of doing it. Whether they should or shouldn't, i think, is not the question here - here we are more concerned about whether the Web allows people to do what they need.

It seems to me that the question is rather, given that people want to produce র‍্যা and the only way to do so at the moment in Bengali is using a ZWJ, are they struggling to do so or not?

vivekpani commented 6 years ago

While this is not a significant deviation because it is a display variant, the chapter 12 gives examples of ya-phala joining vowels for Bengali. I am not sure if there are examples of use. I agree with your reason and I think it is a fonts and rendering issue. People may struggle in learning the use of ZWJ and to discover that on keyboards (The INSCRIPT standard doesn't make it apparent especially because ZWJ and ZWNJ do not have a visual representation and aren't visually represented in the keyboard).

akshatsj commented 6 years ago

Sorry to jump into this discussion a bit late. I will try to bring in some more clarity regarding the various standards that have become part of this discussion.

IS 16350 is a keyboard layout standard for all the scheduled Indian languages and specifies INSCRIPT layout for all. This standard specifies ZWNJ and ZWJ as a part of every language keyboard. It cursorily discusses usage patterns of ZWNJ and ZWJ. Being a keyboard layout standard, it does not go in exhaustive cases and scenarios in which the said characters can/should be used. However, IS 16350 does not discourage the use of either of them. There is another standard IS 16333 which is oriented towards specifying the charset for mobile handsets to support in terms of display and inputting. In the drafting of that standard there were discussions about the usage of ZWJ and ZWNJ.

Whether some standards recommend or discourage the use of certain characters, is immaterial unless compliance to those standards is made mandatory. Till recently (before the IS 16333 which introduces compliance in terms of common minimum charset support for Mobile handsets to be sold in India) the norm established by the major players in the industry became user standards. This indeed highlights a strong need for a central location which can always put forth the guidelines about use of these special characters. The Unicode consortium in Unicode Chapter 12 (http://www.unicode.org/versions/Unicode11.0.0/ch12.pdf) does specify certain use cases. Here at W3C, like Richard said, we should be concerned about whether a user is able to do what he intends to do. If not, what are the potential gap areas (e.g. need for a central location to which anyone inclusing regular user, content creators, font developers, rendering engine developers can refer to for standardized usage) and how to bridge them.