Closed birchamp closed 4 years ago
Needs to be discussed with @klappy
@birchamp - rather than a string-punctuation-tokenizer issue, It looks like this should be a tC issue? Or is this tC-create?
@PhotoNomad0 If we address this in a common library there will be parity between the applications usage of normalization and reduction of duplicated effort on a highly complex task. String normalization appears simple at first but quickly becomes a maintenance headache.
@klappy agreed. Do we have a plan on how to do this that won't bring the app performance to a crawl on slow machines? I'm guessing we don't have detail on how this would be implemented for Hebrew. The link looks very general. Are we mostly talking about Hebrew words where the accent characters are in different order? Or are there other cases?
My understanding is that there is a set list of these entities that are identical and there may be a preferred rendered output that we want it to be.
The tokenizer has a new feature for normalization in the latest release to allow for custom normalization. We can continue working on the implementation and refining it as we have more use cases.
Oh, and for performance, normalization should only be run when absolutely necessary. Since our use case is Original languages then that should be easy to run that once on the Hebrew text prior to parsing it. That is way more efficient that running it on every verse, word, or token that gets rendered.
@klappy - makes sense, but requirements are still vague. Can you get that set list of these entities
and attach them here. And it does seem strange that we are doing word normalization in a string tokenizer. Also it would seem better to have a rule set then a list of identical entities since that could cause a lot of churn as they keep finding one more identity that they noticed.
Or perhaps I am misunderstanding what a set list of these entities
entails.
Also, haven't read all comments here, but would doing any normalization on the DCS side through a git hook work at all so all apps are using the same data and get it massaged correctly? Go (the language DCS is written in) does support text normalization: https://blog.golang.org/normalization
https://unicode.org/faq/normalization.html#10
Q: But isn't there is still a problem with Biblical Hebrew?
A: There was a problem, but it has been addressed. Because the Hebrew points are defined to have distinct combining classes, their character semantics is such that their ordering is immaterial in the standard. To handle those cases where visual ordering is material, see the discussion of the Combining Grapheme Joiner (CGJ) in Section 23.2, Layout Controls, in the Unicode Standard.
article: a bit dated (but significant author)
only useful for "Appendix: Keyboard Charts" pp. 16-17
http://www.ntresources.com/blog/documents/Unicode4BibStudies.pdf
SBL: info on normalization & a helpful examples / test cases (e.g., Ps 27:13): also includes "recommended mark ordering"
StackOverflow article seems to be related to the [old] Safari problem mentioned on the phone:
Screenshot of a normalization problem from a text of 1 Chr. 13:13. I have not verified which encoding UHT uses.
TODO: confirm with @klappy if Greek normalization is is still required
My opinion is that yes, we need this for Greek too.
Based on https://unicode.org/reports/tr15/#Canonical_Equivalence and following. Note that: The outputs are not required to be identical, only canonically equivalent.
DoD: Opening a canonically equivalent, but different Hebrew text highlights the same as another Hebrew text. Text UHB version 2.1.9
Needs a definition of what characters can be re-ordered. Consult @jag3773