liblouis / liblouis

Open-source braille translator and back-translator.
http://liblouis.io
GNU Lesser General Public License v2.1
259 stars 209 forks source link

glyph vs diaeresis #98

Open egli opened 9 years ago

egli commented 9 years ago

Simon Aittamaa says:

I realize that Michael Gray has already answered this question, but I'll expand on it a bit since his answer might have been overlooked...

As mentioned by Michael Gray, this problem arises from the use of combining characters, see [1] and [2]. The letter 'ö', can either be represented as a glyph (0x00f6) or as 'o' (0x006f) and diaeresis/umlaut '¨' (0x0308).

Most systems use the glyph (at least for 'ö'), i.e. Linux and Windows while OSX uses the combination of 'o' and '¨', which is why you get the stray \x0308 in your result. There might be libraries that can combine unicode sequences into glyphs (libiconv?), but I haven't looked into that.

It might be that we have to decompose/normalize [3] all characters, e.g. expand 'ö' into the canonical form ('o','¨'), in order to avoid this problem completely. However, this would require moving away from the current approach where each character is a single uint16_t/uin32_t.

IMHO, the latter approach, i.e. using the normalized/canonical form would be better, since it's the prescribed means of checking for equivalence by the standard, but it would require a major overhaul of how characters/strings are handled in liblouis (which in itself might be a good thing?).

Arend Arends wrote:

In principle Liblouis and the tables could handle both forms (single glyphs and combinations of a character and a diacritic symbol). It seems that currently both Liblouis and most tables handle only the single glyph forms, so the most practical would be to provide an extra pass to convert strings with diacritic symbols.

Paul Wood wrote:

Is it possible to add this library and 'Normalise' the utf-8 characters? http://julialang.org/utf8proc/

egli commented 9 years ago

Aaron Cannon writes:

Another thing to be aware of is that sometimes there is no single unicode codepoint for representing a character. So, while you can compose the O and acute codepoints into a single o-acute codepoint, this is not always the case. This is not likely to effect much western writing, but it is possible, and it definitely will impact supporting the IPA braille code. This is why I've been unable to finish my work on the IPA table I started a while back.

So, my recomendation would be to support decomposed characters by default, and for convenience, all characters in tables and input should be decomposed automagically.

And one other thing is that we'll need a way of handling when the character modifiers come before the letter they modify, as they do in UEB. In other words, if you write a c with a circumflex, in decomposed Unicode, it's a c codepoint, followed by a combining circumflex codepoint. But, in UEB, the circumflex indicator comes before the c.

I recommend we have a generic way of specifying the sign for the circumflex, and other similar modifiers, rather than trying to anticipate all the possible combinations folks might want to use.

egli commented 9 years ago

The NFC FAQ seems relevant here.

bertfrees commented 9 years ago

Thanks. There is also NFD which decomposes everything.

I still don't quite understand if this can be solved by using the proper encoding or not. Liblouis doesn't have to do the normalisation, it can require the applications that call Liblouis to do that, but it does need to know which characters to treat as one sign. An encoding that allows to do that easily would help a lot.

An alternative is to allow any sequence of characters in character definition.

bertfrees commented 5 years ago

I think, as a first step at least, two tables that implement NFC and NFD with correct rules would be useful. Then a table can choose which one it includes and represent the characters accordingly in its rules. The most logical would be to do NFC because character definitions are currently a single widechar. But it is also possible (although slightly more challenging) to write rules for NFD decomposed characters. According to Aaron it is sometimes even required because some characters can not be represented by a single codepoint. And in some case it even makes more sense to handle the components as separate characters.

Later it might be useful to replace the NFC/NFD table with C code and add a mode or opcode to control it.

Added the "good first issue" label although I'm not sure it is so trivial to implement this in a table. It might be easier to do it in C directly.

I don't think we necessarily need a major overhaul of how characters are handled in Liblouis. If needed we can consider allowing more than one "character" (character component) in character definitions, and this would indeed be quite a big change, but we should be able to work around the limitations to some extend at least. So I would say let's wait and see.

Also related is the whole discussion about UTF-16 surrogate pairs.

MikeGray-APH commented 5 years ago

One issue with dealing with NFC/NFD is capitalization. For example, in UEB the letter modifiers must be before any capital indicator.

bertfrees commented 5 years ago

Hi @MikeGray-APH this doesn't seem to be confirmed by the tests? Or am I looking wrong? Also is this really an issue in the table, i.e. does the table not support it, or is the way it was implemented problematic?

MikeGray-APH commented 5 years ago

Letter modifiers require special treatment when it comes to the placement of indicators (markEmphases), so the implementation would have to be modified. Also, modifiers probably will need their own character attribute.

bertfrees commented 5 years ago

Shouldn't this expected behavior be covered by the tests? I think I even found examples in the tests that contradict the expected behavior (Étude, Voyage À Nice, etc.). Capital sign comes first in these tests.

What about the "move after capital sign" comment in the "Modifiers" section?

MikeGray-APH commented 5 years ago

The tests are using the precomposed Unicode characters. When using the combining characters the capital sign is in the wrong place.

bertfrees commented 5 years ago

You said before that

[...] in UEB the letter modifiers must be before any capital indicator.

Everything indicates that you made a mistake in that sentence, that would explain a lot. The other way around makes more sense.

It would be good to have "decomposed" versions of these tests, that would make it more obvious that there is an issue.

But yes, to get back to the subject, capitalisation is indeed an issue. Native support for modifiers, i.e. being able to define characters with multiple components, or to have a special character category for modifiers as you suggest, would definitely make things much easier, but I think today a lot is possible already. The pass2 workaround that you did for the tilde could be applied to all accents I would think. And even for sequences of capitalized letters there might be a way. Again some more tests would be helpful because somebody like myself could experiment with the table and try to get these tests to work.

MikeGray-APH commented 5 years ago

When I said "before", it was relative to the position of the letter that was being modified. If a modified letter needs a capital indicator, then that modifier must be before the capital indicator relative to the position of that letter. I see now how this wasn't clear, I was talking about relative positions and not absolute positions.

You are correct in that what I did for the tilde could be applied to all modifiers. It won't work when multiple combining characters are used, however. I don't remember why only the tilde got fixed. Looking back, a more solid solution would have been to mark the modifiers with an attribute that the markEmphases function could then check for, and adjust the placement of the indicators accordingly.

bertfrees commented 5 years ago

Yes that's indeed one solution, but it needs to be thought through a bit more . It would be most logical if the character with attribute "modifier" is connected to the character before it, however what if you switch them in the correct pass like you do in the UEB table? One solution would be to change these correct rules into context rules and make it possible to perform certain translation rules that now happen in pass 1, in pass 2, i.e. after the emphasis processing. This is a direction in which I've been wanting to go anyway. Alternatively we could somehow make it possible to split the emphasis handling over multiple passes, i.e. do the analyzing at the beginning of one pass, and the insertion of indicators at the end of another pass.

LeonarddeR commented 5 months ago

@bertfrees wrote:

Liblouis doesn't have to do the normalisation, it can require the applications that call Liblouis to do that...

I'm Investigating normalization for https://github.com/nvaccess/nvda/issues/16466. There's a major problem with expecting the screen reader to do the normalization first, namely that the raw to braille positions and the braille to raw positions are corresponding with the normalized input, not the raw input. So for example with NFKC, ij normalizes to ij. When providing ij to liblouis, it treats the input as two characters, while the raw input takes only one character. In NVDA, this will cause an off by one error for every normalization that decomposes characters. Of course the opposite would apply to NFD.

bertfrees commented 4 months ago

@LeonarddeR Hmm good point. You would need the mapping of the string normalization step too and combine the mappings. Did you find a solution?

LeonarddeR commented 4 months ago

Yes, I have found one. The result is in https://github.com/nvaccess/nvda/pull/16521 . In fact, I used mapping as you suggested.