unfoldingWord / string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation
https://string-punctuation-tokenizer.netlify.app/#/Tokenize
MIT License
8 stars 1 forks source link

[6]SPIKE: Normalize Hebrew strings for canonical equivalence according to Unicode standard #46

Closed birchamp closed 4 years ago

birchamp commented 4 years ago

Based on https://unicode.org/reports/tr15/#Canonical_Equivalence and following. Note that: The outputs are not required to be identical, only canonically equivalent.

DoD: Opening a canonically equivalent, but different Hebrew text highlights the same as another Hebrew text. Text UHB version 2.1.9

Needs a definition of what characters can be re-ordered. Consult @jag3773

birchamp commented 4 years ago

Needs to be discussed with @klappy

PhotoNomad0 commented 4 years ago

@birchamp - rather than a string-punctuation-tokenizer issue, It looks like this should be a tC issue? Or is this tC-create?

klappy commented 4 years ago

@PhotoNomad0 If we address this in a common library there will be parity between the applications usage of normalization and reduction of duplicated effort on a highly complex task. String normalization appears simple at first but quickly becomes a maintenance headache.

PhotoNomad0 commented 4 years ago

@klappy agreed. Do we have a plan on how to do this that won't bring the app performance to a crawl on slow machines? I'm guessing we don't have detail on how this would be implemented for Hebrew. The link looks very general. Are we mostly talking about Hebrew words where the accent characters are in different order? Or are there other cases?

klappy commented 4 years ago

My understanding is that there is a set list of these entities that are identical and there may be a preferred rendered output that we want it to be.

The tokenizer has a new feature for normalization in the latest release to allow for custom normalization. We can continue working on the implementation and refining it as we have more use cases.

klappy commented 4 years ago

Oh, and for performance, normalization should only be run when absolutely necessary. Since our use case is Original languages then that should be easy to run that once on the Hebrew text prior to parsing it. That is way more efficient that running it on every verse, word, or token that gets rendered.

PhotoNomad0 commented 4 years ago

@klappy - makes sense, but requirements are still vague. Can you get that set list of these entities and attach them here. And it does seem strange that we are doing word normalization in a string tokenizer. Also it would seem better to have a rule set then a list of identical entities since that could cause a lot of churn as they keep finding one more identity that they noticed.

Or perhaps I am misunderstanding what a set list of these entities entails.

richmahn commented 4 years ago

Also, haven't read all comments here, but would doing any normalization on the DCS side through a git hook work at all so all apps are using the same data and get it massaged correctly? Go (the language DCS is written in) does support text normalization: https://blog.golang.org/normalization

ancientTexts-net commented 4 years ago

https://unicode.org/faq/normalization.html#10

Q: But isn't there is still a problem with Biblical Hebrew?

A: There was a problem, but it has been addressed. Because the Hebrew points are defined to have distinct combining classes, their character semantics is such that their ordering is immaterial in the standard. To handle those cases where visual ordering is material, see the discussion of the Combining Grapheme Joiner (CGJ) in Section 23.2, Layout Controls, in the Unicode Standard.

ancientTexts-net commented 4 years ago

article: a bit dated (but significant author)

only useful for "Appendix: Keyboard Charts" pp. 16-17

http://www.ntresources.com/blog/documents/Unicode4BibStudies.pdf

ancientTexts-net commented 4 years ago

SBL: info on normalization & a helpful examples / test cases (e.g., Ps 27:13): also includes "recommended mark ordering"

https://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf

ancientTexts-net commented 4 years ago

StackOverflow article seems to be related to the [old] Safari problem mentioned on the phone:

https://stackoverflow.com/questions/11176603/how-to-avoid-browsers-unicode-normalization-when-submitting-a-form-with-unicode

ancientTexts-net commented 4 years ago

Screenshot of a normalization problem from a text of 1 Chr. 13:13. I have not verified which encoding UHT uses.

Unicode-Normalization - Example 1 Chr 13 13.png

ancientTexts-net commented 4 years ago

TODO: confirm with @klappy if Greek normalization is is still required

jag3773 commented 4 years ago

My opinion is that yes, we need this for Greek too.