NotoSansTamil : Incorrect rendering of ஶ்ரீ (U+0BB8 U+0BCD U+0BB0 U+0BC0)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Type / View the character ஶ்ரீ (U+0BB8 U+0BCD U+0BB0 U+0BC0)
2. Type / View the character sequence ஸ்ரீ (U+0BB6 U+0BCD U+0BB0 U+0BC0)

What is the expected output? What do you see instead?
1. For 1, the expected output is a single glyph (shritamil ஶ்ரீ). 
Actual output is seperate glyphs of shaprehalftamil (ஶ்) and raiivowelsign 
(ரீ)  (ஶ் ரீ without a space in middle)

2. For 2, the expected output is seperate glyphs of saprehalftamil (ஸ்)  
(ரீ) (ஸ் ரீ without a space in middle). Actual output is a single 
glyph (shritamil ஶ்ரீ)

What version of the product are you using? On what operating system?
Noto Sans Tamil 1.03 on Android Kitkat, Ubuntu.

Please provide any additional information below.

Complex glyph SRI was changed from U+0BB8 U+0BCD U+0BB0 U+0BC0 to U+0BB6 U+0BCD 
U+0BB0 U+0BC0 in Unicode 4.1.[1] NotoSansTamil (like many other fonts) are 
still using the older definition for the complex glyph. Apple / iOS are using 
updated fonts and as a result, text written in old format (and rendered as 
single shri glyph in NotoSansTamil font) will be rendered as seperate glyphs 
and viceverse. This defeats the purpose of interoperability that Unicode gives.

Suggested Fix :-
The ligature subtable for shritamil glyph currently has
saprehalftamil ratamil(ர) iivowelsigntamil(ீ)
satamil(ஸ) viramatamil(்) ratamil(ர) iivowelsigntamil(ீ)

It should be replaced with 

shaprehalftamil ratamil(ர) iivowelsigntamil(ீ)
shatamil(ஶ) viramatamil(்) ratamil(ர) iivowelsigntamil(ீ)

This change will be in line with latest unicode standard.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1078661 similar bug for 
Lohit-Tamil font.

[1] http://www.unicode.org/L2/L2005/05129-tamil-named.txt

Original issue reported on code.google.com by srik....@gmail.com on 23 Mar 2014 at 8:14

Attachments:

[SHRI Screenshot.png](https://storage.googleapis.com/google-code-attachments/noto/issue-23/comment-0/SHRI Screenshot.png)

GoogleCodeExporter commented 9 years ago

Thanks a lot for the report. I agree that SHA+VIRAMA+RA+II should form a single 
ligature. I filed that internally as noto-alpha/192.

For SA+VIRAMA+RA+II, it's not clear to me what Unicode has decided (if it 
should only be displayed with a visible pulli, or only with a ligature, or both 
are acceptable). The reference you provided (L2/05-129) doesn't say anything 
about that sequence, and I could not arrive at a conclusion from reading 
section 9.6 of Unicode Core Specification, version 6.2. It appears to me that 
all three Tamil fonts in Windows 8.1 render SA+VIRAMA+RA+II as the same 
ligature.

Would you please point us to a UTC decision or Unicode text where it says or 
implies that SA+VIRAMA+RA+II should be rendered with a visible pulli?

Original comment by roozbeh@google.com on 2 Apr 2014 at 11:30

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

http://www.unicode.org/faq/tamil.html#12 says that 'mapping should be 
**updated** from <U+0BB8, U+0BCD, U+0BB0, U+0BC0> to <U+0BB6, U+0BCD, U+0BB0, 
U+0BC0>' instead of saying something like 'a new mapping should be added from 
<U+0BB6, U+0BCD, U+0BB0, U+0BC0>' or 'along with the old mapping, a new mapping 
has to be added' which can very well imply that SA+VIRAMA+RA+II should be 
rendered in non conjunct form with visible pulli.

Linguistically SHRI is a character and having dual encoding does more 
harm(affects search etc) than good(compatibility sake). SA+VIRAMA+RA+II was not 
the equivalent of SHRI and which was why a new character SSA (U+0BB8) got 
introduced and the definition was **updated**. I wasn't aware of what Windows 
did, but if they too render complex glyph, thats a bug again.

Original comment by srik....@gmail.com on 6 Apr 2014 at 5:07

GoogleCodeExporter commented 9 years ago

I concur with Srikanth. If fonts continue to display non-standard sequences 
like this, then as Srikanth says the interoperability purpose of the standard 
is lost.

Consider Arabic/Urdu-based names like tasrīn. In Tamil script they should be 
written as தஸ்‌ரீன் (தஸ்.ரீன் without the dot) 
but with the current behaviour they are displayed identical to 
தஶ்ரீன் whereas ஶ்ரீ is only ever found in Sanskrit-based 
names. (On Firefox 28 on my Kubuntu Saucy system I am able to prevent the 
ligature by using ZWNJ but that should not be required for normal usage.)

Original comment by samj...@gmail.com on 6 Apr 2014 at 5:58

GoogleCodeExporter commented 9 years ago

Thanks a lot for the examples and the discussion.

It appears that SA+VIRAMA+RA+II is very commonly used for "sri/shri" on the web 
(compare Google search results for both sequences), including on the title page 
of the Tamil Wikipedia article about the ligature: 
http://ta.wikipedia.org/s/14u0

I'm following the SA+VIRAMA+RA+II issue up with the Unicode Technical 
Committee, and will bring it up at our next meeting in early May, with a 
pointer to the discussion here.

Original comment by roozbeh@google.com on 17 Apr 2014 at 1:47

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Thank you very much for taking this up for UTC. If you look at the same 
wikipedia page about the origin of the character, it says U+0BB6 is its root. 
The comparison is google search results will have inherent bias to old sequence 
since most of the fonts / input tools did not adopt new encoding. The problem 
is visible off late (more than couple of years now) since Apple adopted the 
latest standard, hence causing fragmentation. 

Errata on Comment #2 'why a new character SSA (U+0BB8)' should be read as 'why 
a new character SSA (U+0BB6)'

Original comment by srik....@gmail.com on 19 Apr 2014 at 7:13

GoogleCodeExporter commented 9 years ago

The bug is fixed in r245. I also got an action item from the the UTC to write a 
proposal about the problem: http://www.unicode.org/L2/L2014/14100.htm#139-A37

Original comment by roozbeh@google.com on 16 May 2014 at 1:11

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Hello Roozbeh. Please can you elaborate what was decided in UTC? The AI doesn't 
explain that.

Original comment by jamada...@gmail.com on 16 May 2014 at 1:49

GoogleCodeExporter commented 9 years ago

Nothing was decided. I was told to come up with a proposal that tells what 
exactly needs to be changed in which parts of the standard. UTC will decide 
what to do when they see the proposal.

Original comment by roozbeh@google.com on 16 May 2014 at 2:01

GoogleCodeExporter commented 9 years ago

Then what exactly was "fixed in r245"?

Original comment by samj...@gmail.com on 16 May 2014 at 2:17

GoogleCodeExporter commented 9 years ago

SHA+VIRAMA+RA+II now forms a ligature.

Original comment by roozbeh@google.com on 16 May 2014 at 4:37

GoogleCodeExporter commented 9 years ago

Hello Roozbeh, can you please file and bug for and fix the same problem with 
Droid Sans Tamil too? (Sorry for putting it on you but I'm really bogged up 
here, whence I didn't call in to the UTC too.) Thanks.

Original comment by jamada...@gmail.com on 17 May 2014 at 5:01

GoogleCodeExporter commented 9 years ago

Droid Sans Tamil is no longer supported. Only Noto is supported.

Original comment by roozbeh@google.com on 17 May 2014 at 6:01

GoogleCodeExporter commented 9 years ago

Thanks for following this up. I am unaware of unicode process, but if its okay, 
can you please share your proposal when its ready, so that we could give 
feedback on the same before it gets discussed in UTC. Thanks

Original comment by srik....@gmail.com on 17 May 2014 at 12:35

wangwhai / noto

NotoSansTamil : Incorrect rendering of ஶ்ரீ (U+0BB8 U+0BCD U+0BB0 U+0BC0) #23