harfbuzz / harfbuzz

HarfBuzz text shaping engine
http://harfbuzz.github.io/
Other
4.12k stars 627 forks source link

Indic: Visarga characters should combine correctly after Vedic tone markers #2017

Closed jamadagni closed 3 years ago

jamadagni commented 5 years ago

Reported at LibreOffice and Firefox but was redirected upstream here.

The following behaviour tested with current master: b0b8551a.

Please see the following rendering of the sequence दे॒वेभ्य॑ः - render It is observed that there is a dotted circle before the visarga but there should not be.

Reason:

Indic visarga characters are always spacing and placed to the right of the syllable. Vedic tone markers whether spacing or non-spacing will always be input before visarga because they apply to the vowel before the visarga.

So the expected sequence is:

syllable + zero or more tone markers + visarga

and so sequences like the example above where the visarga is placed after such a tone marker should not cause a dotted circle due to cluster breakup.

The file http://www.unicode.org/Public/UNIDATA/IndicSyllabicCategory.txt does an excellent job of listing the visarga characters and the tone markers under the sections:

Indic_Syllabic_Category=Visarga Indic_Syllabic_Category=Cantillation_Mark

Notes:

I am not aware of any one publicly available font that supports all relevant characters since Vedic is a rare use case. I attach the OFL-ed Lohit Devanagari locally modified by myself adding the extra required characters: Lohit-Devanagari-shriramana-20190219-1416.ttf.gz

I also upload a Python script to produce test case sequences: sample.py.txt (I had to add a .txt extension else GH doesn't accept the attachment.)

Note that currently the script only produces test cases for Devanagari as I only have a Vedic-supporting font for that script, but you can see that it can be easily toggled to printing randomly other scripts also.

BTW I don't know whether this is connected to / same as #1142 (as it was closed with an enigmatic “meh”).

dscorbett commented 5 years ago

I agree that the Vedic tone mark should precede the visarga, but not for the reason you give (that the tone mark applies to the vowel before the visarga). Both are, in USE terms, vowel modifiers. Above-base vowel modifiers (like svarita) precede post-base vowel modifiers (like visarga). However, Devanagari uses the Indic shaper, not USE, which is why the marks are reversed. Unfortunately, for compatibility with Windows, I don’t think we can change this, since DirectWrite inserts a dotted circle in this string too. (Incidentally, this will also necessitate a hack so that the current order will still work in 'dev3'.)

jamadagni commented 5 years ago

Sorry but I don't understand this: how is bug-for-bug compatibility with Microsoft software a good thing? Feature-for-feature compatibility I can understand.

Valid sequences are to be certified by native users and displayed correctly by software. If the current behaviour is wrong it is wrong no matter which software does it that way, whether Microsoft or Apple or whoever.

One would expect open-source to be pioneering in supporting such scholarly use cases which may not be of commercial interest to big companies.

dscorbett commented 5 years ago

It’s suboptimal but not a bug. Mark order is somewhat arbitrary; more important is that everything (IMEs, fonts, shapers, renderers, Vedic parsing libraries, etc.) agree on the order. The order has been stable for years; changing it now would invalidate all past encoded data. Moreover, changing it in HarfBuzz would not fix other shapers, so some new data would continue to be created with the current order, so everything would have to support both orders, which is worse than the status quo.

I could see a reason to change it if Unicode explicitly says the current order is wrong, or if it turns out that most existing data already puts the visarga at the end.

jamadagni commented 5 years ago

Sorry but it is not clear to me what you expect the input order should be to get the desired rendering but avoid creation of a dotted circle. Can you please clarify?

dscorbett commented 5 years ago

The input order should be <U+092F, U+0903, U+0951>, i.e. the current order. The font should skip over the visarga when positioning the svarita relative to the base.

jamadagni commented 5 years ago

This is “current order” by what Unicode or other standard please?

dscorbett commented 5 years ago

It’s not a standard, I just mean the order currently enforced by DirectWrite and HarfBuzz.

behdad commented 5 years ago

how is bug-for-bug compatibility with Microsoft software a good thing?

Users expect the same font to display the same in all software. That requires sometimes doing bug-to-bug compatibility with Microsoft software.

You can read more about our philosophy re that at https://goo.gl/9eWCLy

jamadagni commented 5 years ago

@behdad Nice article. However I note the following two points in it:

1.

produce the same results as Uniscribe, unless Uniscribe behavior didn’t make sense / was bogus and we could do a better job.

… and the immediately next line in it: 2.

A lot of the quirks in Uniscribe however, we decided to match, because we really want the same font and text combination to render the same on every platform

If it is a question of minor positioning, I would allow that #2 applies to the present case. However, this is a question of a sequence being tagged as invalid by displaying the dotted circle. So I do think that #1 applies instead.

Anyhow, let me take this issue further “upstream” ie to Unicode, so it can be fixed in both Microsoft and HB.

behdad commented 5 years ago

@jamadagni Oh I wasn't commenting about current issue. Just justifying our design decisions.

If there's something to be fixed in Unicode, that's even better.

behdad commented 3 years ago

@dscorbett is there anything actionable here?

dscorbett commented 3 years ago

I don’t think so, unless Unicode ever clarifies that the currently enforced order is wrong.

behdad commented 3 years ago

Thanks.

jamadagni commented 3 years ago

Have submitted http://www.unicode.org/L2/L2021/21054-svara-markers.pdf

vvasuki commented 2 years ago

Have submitted http://www.unicode.org/L2/L2021/21054-svara-markers.pdf

I'm not familiar with Unicode process but hear that it takes a long time. When is the decision about the above expected? (Affects transliteration software I maintain, which is used by some major corpus-maintainers.)

jamadagni commented 2 years ago

Sorry about this. Got the following feedback back on 2021-Jun-29 but apparently omitted to update here:

Comments: We reviewed this document that identified a problem with the current encoding of Vedic text, specifically nonspacing svara marks (tone marks) and post-base markers (primarily visarga and anusvara in Bengali and South Indian scripts).

The problem is that users expect the sequence to be syllable + nonspacing svara mark(s) + the spacing mark. However, current text shaping engines mark this sequence as illegal and a dotted circle appears before the spacing mark; the permitted order is syllable + spacing mark + svara.

The author states that Vedic support is still in its infancy and requests TUS recommend svara marks be allowed before post-base visarga and anusvara.

The following points were raised:

The shaping engines agree on the behavior of svara markers and post-base spacing marks (i.e., syllable + spacing mark + svara) and they all follow the Core Spec (14.0 P460: R10) and follow the documentation of the OpenType Devanagari shaping engine.
Modern Input methods, such as Keyman, can re-order what users input to correct logical order.
It is not new for Brahmic scripts that the logical order differs from the visual order: left-side vowels are encoded post-base for most Brahmic script. 
Making the change as proposed by Shriramana would mean the strings encoded in the old and new orders would not be canonically equivalent.
The Indic syllable structure needs to be investigated and clarified. The current documentation in the Unicode Standard and in OpenType shaping engine documentation is imprecise, inconsistent, and, for Vedic characters, incompatible with normalization. Members are invited to read N. Lindenberg’s Devanagari cluster validation document [L2/21-112](http://www.unicode.org/L2/L2021/21112-deva-cluster-valid.pdf).
If a proposal were to be considered, what would users expect if one of the svaras co-occurs with vowel signs or other signs above?

Recommendation: There was no consensus to make a change.