Discuss character normalization scheme

kaykurokawa commented 6 years ago

This is an issue opened up to discuss the character normalization scheme in https://github.com/lbryio/lbrycrd/pull/159 . The issue of character normalization is not a topic of expertise that the blockchain team has in particular so this should be opened up for discussion with everyone else. The details of the normalization should not affect the PR outside of the specific normalization function and unit tests.

The current scheme proposed in 159 is to implement NFD normalization (see below link on information)

http://unicode.org/reports/tr15/ http://unicode.org/faq/normalization.html

And than lower case the letters using locale en_US

I suspect that this scheme will have some problems handling other languages. In particular note that international domain names utilizes a custom scheme to deal with various locale needs ( https://en.wikipedia.org/wiki/Internationalized_domain_name - I wonder if there is there some module we can import for this?? ).

Note that any bad scheme we end up implementing here will be extremely troublesome for us to undo since we have to dive into the details of various languages and manually fix the normalization, so it is important that we get it right the first time.

kauffj commented 6 years ago

I haven't done a deep dive on this yet, but chiming in to say that we should avoid any US-centric solution.

BrannonKing commented 6 years ago

A few thoughts: lbrycrd is already US-centric in that it's CLI only shows English help, and its RPC commands use English field names. It also returns JSON, which according to that standard, requires US-like periods in floating point numbers. JSON also has invariant date format requirements.

You can see here a chart of the methods that use the locale: http://userguide.icu-project.org/services Unfortunately, it's hard to determine exactly which ICU methods we're using because of the boost wrapper. Generally speaking, though, those formatting methods aren't the features we're looking for.

In Naut's code, we see that he first calls normalize and then to_lower. The former call should put it into a locale-neutral form before the case is lowered. I don't think it's actually using en_US to do the case conversion. I think that this is the best that we can do without taking locale in on claim requests.

kauffj commented 6 years ago

1) I agree that if we have to pick a locale, that the US makes sense. 2) I understand the locale cannot be context dependent (somewhat obviously) 3) The fact that underlying, non-user-facing data is US-centric is somewhat irrelevant. Every HTTP request contains English strings but this has little to no affect on it's usability in other languages. 4) Is there really no convention here that isn't locale-specific? I would think this is something that applies to any non-locale specific TLD.

BrannonKing commented 6 years ago

As I understand it (and mentioned above), the "normalize" method is the equivalent of "make it non-locale-specific".

kaykurokawa commented 6 years ago

From my understanding NFD normalization is not locale specific, but the function "lower" is.

lbrynaut commented 6 years ago

From my understanding NFD normalization is not locale specific, but the function "lower" is.

Unicode is unicode, so picking a normalization form (like NFD) is language agnostic. It's correct that the lower case algorithm we're using is. Boost provides the locale aware version as well (via ICU), but I opted not to use that because while it is locale specific, it will not work where that locale is unavailable. What this means is that the lower case algorithm chosen may not actually lower case the words represented in unicode that cannot be lower cased without the locale awareness, and I think that's a choice that we live with for consistency.

lbrynaut commented 6 years ago

The issue of character normalization is not a topic of expertise that the blockchain team has in particular so this should be opened up for discussion with everyone else.

Agreed, but it's a topic that I have more experience in than many. Doesn't mean I'm an expert.

lbrynaut commented 6 years ago

Note that any bad scheme we end up implementing here will be extremely troublesome for us to undo since we have to dive into the details of various languages and manually fix the normalization, so it is important that we get it right the first time.

I don't agree since the unicode normalization will not be an issue. Knowing that there may be international casing issues is something I think we can live with. In the end, we are still doing a best effort to normalize the code points consistently, which is far better than what we have today (which is just about guaranteed to break or be inconsistent with different languages/locales).

BrannonKing commented 6 years ago

This was an interesting read: http://unicode.org/faq/casemap_charprop.html

BrannonKing commented 6 years ago

One consequence of our current trie structure is that partial unicode names will no longer make sense; there's actually a data loss on visualization. Hence, our RPC method getclaimtrie will be even more useless than it already is. E.g.

D: , 0, 
D: a, 0, 
D: am, 0, 
D: ame, 0, 
D: ame�, 0, 
D: amé, 0, 
D: amél, 0, 
D: améli, 0, 
D: amélie, 151, C: 151, 151, ccee2544f588f1460b651a4b515d5aaffaf44a0c, 
D: an, 0, 
D: an�, 0, 
D: añ, 0, 
D: añe, 0, 
D: añej, 0, 
D: añejo, 153, C: 153, 153, d77eb2e78153b931ebdd5b25d35c890f733cd821, 
D: f, 0, 
D: fo, 0, 
D: foo, 148, C: 149, 149, 813d5644bb3446ef125d9dc693930faee8bb094d, 
D: n, 0, 
D: no, 0, 
D: nor, 0, 
D: norm, 0, 
D: norma, 0, 
D: normal, 0, 
D: normali, 0, 
D: normaliz, 0, 
D: normalize, 0, 
D: normalizet, 0, 
D: normalizete, 0, 
D: normalizetes, 0, 
D: normalizetest, 144, C: 150, 150, f10565974200feab9fd6482e3ec38340e1f217dd, C: 148, 146, 83e64772ee7068459fdd83336fd6600dc54b5bd4, C: 142, 142, 744daead267cf57bd2aab929b4f8d8f34e352993, 
D: �, 0, 
D: ��, 0, 
D: あ, 0, 
D: あ�, 0, 
D: あ��, 0, 
D: あて, 0, 
D: あて�, 0, 
D: あて��, 0, 
D: あては, 0, 
D: あては�, 0, 
D: あては��, 0, 
D: あてはま, 0, 
D: あてはま�, 0, 
D: あてはま��, 0, 
D: あてはまる, 152, C: 152, 152, 4e2223dbe0255db22539b75b30421d6e17bb83e4,

kauffj commented 5 years ago

Late comment on this, but I find rejecting claims for names with unreadable/unrenderable characters to be perfectly acceptable.

There may be other reasons this is a bad idea, in which case I defer to the blockchain team's judgement. But from a UX perspective, I think it's completely unobjectionable to reject unrenderable claim names.

kaykurokawa commented 5 years ago

@lbrynaut says that we should look into whether lowering the case can throw an error , i.e there may be cases where it may not know what to do.

BrannonKing commented 5 years ago

I think this can be closed. Reopen it for further discussion. As it stands, we'll go with the current decisions:

We will use the NFD, en_US locale.
If the incoming UTF8 data is invalid, we will use the data but it will not be normalized or lower-cased, and will compete only against those exact bytes. We assume that current apps and clients will display partial characters (aka, � symbols) and that they will disallow entry of them.
We will eliminate the archaic getclaimtrie RPC method.

lbryio / lbrycrd

Discuss character normalization scheme #204