Normalize utf8 - Githubissues

zwim commented 2 years ago

An open question still exists: Shall we use NFC or NFKC?

The other open question is: Will this break things? (Although I did not find any quirk on a quick check.)

Companion for https://github.com/koreader/koreader/pull/8863

This change is

zwim commented 2 years ago

I would tend to use NFKC everywhere (and drop NFC) as utf8 casefold uses that too.

poire-z commented 2 years ago

It might break past highlights & last position. xpointers store the indices of the node text, so this might cause some shifts, or indices overflowing the normalized/smaller text node and being not found. I'm afraid we can't migrate these xpointers (we did some past migrations where there were just wrapping nodes added or removed, which didn't change text indices).

So, at best it needs to depend on some updated DOM_VERSION_CURRENT (top of lvtinydom.cpp)- but I don't know/remember if upgrades are allowed between versions, or if we're ok killing/losing everything if the dom version is manually updated.

Also, by how much does this increase loading/parsing time on large books ? (remove cache between runs). No other solution to do this normalization later when needed?

poire-z commented 2 years ago

As for NFC vs NFKC, you might want to look at what Harfbuzz prefers, or how it does with each.

Another thought: when not using Harfbuzz (using our Kerning off/fast/good), we map one unicode char from the source to one freetype glyph. It's possible that some book ships with embedded fonts that are tailored to contain only the glyphs for the unicode chars present in the book (dunno if that is what is called "font subsetting" in numerous Harfbuzz discussions) - so, with doing this normalization, we might break this case.

zwim commented 2 years ago

It's possible that some book ships with embedded fonts that are tailored to contain only the glyphs for the unicode chars present in the book (dunno if that is what is called "font subsetting" in numerous Harfbuzz discussions) - so, with doing this normalization, we might break this case.

That's right, so I will put the normalization into the hyphenation code.

zwim commented 2 years ago

Now the normalization is done in the user dict hyphenate method, works for me with greek words from user-Greek (which are not normalized).

A thought came up: What if ichnilatis-gr hyphenation problems also come from using a normalized pattern file with not normalized books? At least one of his books has not normalized strings (because that is the only way I can think of, that not normalized words end up in the user dictionary). If that is true, we should put the normalization at the beginning of TexHyph::hyphenate (then all hyphenation is done with normalized strings).

poire-z commented 2 years ago

works for me with greek words from user-Greek (which are not normalized).

But do we really want to accept not-normalized user hyph dicts? Didn't I get it right when you said in the other issue that you put correct normalized words in the user hyph dicts ? In which case this is not useful. (Unless you meant we just don't care: we put the words as we get them from the book ? which mean users could have multiple entries for a same word whether it was made from a book with normalized text, or from a book not-normalized ?)

What if ichnilatis-gr hyphenation problems also come from using a normalized pattern file with not normalized books? At least one of his books has not normalized strings (because that is the only way I can think of, that not normalized words end up in the user dictionary).If that is true, we should put the normalization at the beginning of TexHyph::hyphenate (then all hyphenation is done with normalized strings).

Indeed, the problem is more general. If all our patterns are normalized, and if you make user hyph patterns also normalized, you would be just fine on normalized books. And we are already failing on non-normalized books (which must be rare, so I wouldn't mind hyphenation not working on such books, which is already the case, and nobody ever complained :) so they must really be rare). Or do some/all our patterns files have patterns with both normalized and non-normalized forms ? Or are they targetted towards only one form ? I have no idea :)

But if we wanted to support these non-normalized books, indeed, you should do this early in all hyphenate() methods. But the issue is that hyphenate() should return a map of flags by original indices - because it's lvtextfm.cpp that does the word cutting from these flags on the (non-modified/normalized) original source text. If you do some normalization with the utf8proc function, you have no idea where the normalization shifts were happening, so you can't correctly put/correct flags on the original indices. (We do correct this for soft hyphens, which was painful to get right :) but for these, we know how much shifting we are doing - which I guess we can't with utf8proc...).

zwim commented 2 years ago

works for me with greek words from user-Greek (which are not normalized).

These are the words from the original user-Greek.hyph copied to a *.txt file. So that I can play with the original words (remember until now, all our words in the user dictionary are the same as in the book (normalized or not)).

But do we really want to accept not-normalized user hyph dicts?

It does not matter. What matters is, that the user hyphenation algorithm uses a binary search.

Didn't I get it right when you said in the other issue that you put correct normalized words in the user hyph dicts ? In which case this is not useful.

It's not useful in the case, that you can not hyphenate books with not normalized word.

(Unless you meant we just don't care: we put the words as we get them from the book ? which mean users could have multiple entries for a same word whether it was made from a book with normalized text, or from a book not-normalized ?)

Yes, that's the way to go. We allow users to have multiple entries (normalized or not). For that we would just need to rewrite our frontend lowercase function. No mess or side effects in crengine.

poire-z commented 2 years ago

No mess or side effects in crengine.

I'm all for this :) but...

Yes, that's the way to go. We allow users to have multiple entries (normalized or not). For that we would just need to rewrite our frontend lowercase function.

So, normal greek pattern hyphenation will not work on a not-normalized greek book (or only plain words will work). But you will allow a user to add to his user-hyphenation dict non-normalized words, which will then be the only words that get a chance to be hyphenated - so that user may end up (after a few years:) rewritting the whole greek.pattern in a non-normalized way :)

I'm not rushing to get all this fixed in a better way, just throwing some ideas:

if we would know all our *.pattern files are all NFC, or all NFKC, we could require all of them to be in that specific form - and we would also make it so that user hyph dicts are in that specific form.
in the hyphenate() code, we may be could check if the candidate word is in that form, and if not, normalize it in that form while trying to build a indices translation table, by normalizing progressively parts of the words ?

ie, french word (incorrect) : bïnôme Decomposed, we could find it in a book as b i ¨ n o ^ m e

b => b
bi => bi
bi¨ => bï (new indice shift 3=>2)
bi¨n => bïn (same indice shift 4=>3)
bi¨no => bïno (same indice shift 5=>4)
bi¨no^ => bïnô (new indice shift 6=>4)
bi¨no^m => bïnôm (same indice shift 7=>5)
bi¨no^me => bïnôme (same indice shift 8=>6)

We could then get bïnôme to hyphenate at position 2 bï-nöme, and we would translate back the flags so the break ends up allowed at position 3 bi¨-no^me.

zwim commented 2 years ago

So, normal greek pattern hyphenation will not work on a not-normalized greek book (or only plain words will work). But you will allow a user to add to his user-hyphenation dict non-normalized words, which will then be the only words that get a chance to be hyphenated - so that user may end up (after a few years:) rewritting the whole greek.pattern in a non-normalized way :)

That would be my proposal for now.

Maybe a message could give the user a hint, that he is reading a book with not normalized data -> If he does not want to reclick the whole hyph-patterns he can normalize it (with calibre(?) and use an appropriate font).

I'm not rushing to get all this fixed in a better way, just throwing some ideas:

* if we would know all our *.pattern files are all NFC, or all NFKC, we could require all of them to be in that specific form - and we would also make it so that user hyph dicts are in that specific form.
* in the hyphenate() code, we may be could check if the candidate word is in that form, and if not, normalize it in that form while trying to build a indices translation table, by normalizing progressively parts of the words ?

That could be a midterm goal. Would be a proper solution. Once this is done we can easily the not-normalized user dictionary.

koreader / crengine

Normalize utf8 #466