[6]SPIKE: Normalize Hebrew strings for canonical equivalence according to Unicode standard

unfoldingWord / string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

https://string-punctuation-tokenizer.netlify.app/#/Tokenize

MIT License

8 stars 1 forks source link

[6]SPIKE: Normalize Hebrew strings for canonical equivalence according to Unicode standard #46

Closed birchamp closed 4 years ago

birchamp commented 4 years ago

Based on https://unicode.org/reports/tr15/#Canonical_Equivalence and following. Note that: The outputs are not required to be identical, only canonically equivalent.

DoD: Opening a canonically equivalent, but different Hebrew text highlights the same as another Hebrew text. Text UHB version 2.1.9

Needs a definition of what characters can be re-ordered. Consult @jag3773

birchamp commented 4 years ago

Needs to be discussed with @klappy

PhotoNomad0 commented 4 years ago

@birchamp - rather than a string-punctuation-tokenizer issue, It looks like this should be a tC issue? Or is this tC-create?

klappy commented 4 years ago

@PhotoNomad0 If we address this in a common library there will be parity between the applications usage of normalization and reduction of duplicated effort on a highly complex task. String normalization appears simple at first but quickly becomes a maintenance headache.

PhotoNomad0 commented 4 years ago

@klappy agreed. Do we have a plan on how to do this that won't bring the app performance to a crawl on slow machines? I'm guessing we don't have detail on how this would be implemented for Hebrew. The link looks very general. Are we mostly talking about Hebrew words where the accent characters are in different order? Or are there other cases?

klappy commented 4 years ago

My understanding is that there is a set list of these entities that are identical and there may be a preferred rendered output that we want it to be.

The tokenizer has a new feature for normalization in the latest release to allow for custom normalization. We can continue working on the implementation and refining it as we have more use cases.

klappy commented 4 years ago

Oh, and for performance, normalization should only be run when absolutely necessary. Since our use case is Original languages then that should be easy to run that once on the Hebrew text prior to parsing it. That is way more efficient that running it on every verse, word, or token that gets rendered.

PhotoNomad0 commented 4 years ago

@klappy - makes sense, but requirements are still vague. Can you get that set list of these entities and attach them here. And it does seem strange that we are doing word normalization in a string tokenizer. Also it would seem better to have a rule set then a list of identical entities since that could cause a lot of churn as they keep finding one more identity that they noticed.

Or perhaps I am misunderstanding what a set list of these entities entails.

richmahn commented 4 years ago

Also, haven't read all comments here, but would doing any normalization on the DCS side through a git hook work at all so all apps are using the same data and get it massaged correctly? Go (the language DCS is written in) does support text normalization: https://blog.golang.org/normalization