Unicode string normalization

jfroelich commented 5 years ago

Input data should be normalized where appropriate.

See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html

jfroelich commented 5 years ago

Even though I may want to introduce normalization into some earlier stage in processing, I think the place to start for now is at the model layer. The model layer is supposed to be responsible for sanitizing its input, regardless of whatever other layers do interacting with the model.

Therefore I think the best way to start is:

Create a new helper module within the model layer named something like normalize-string that encompasses whatever normalization means and documents it and such
Update sanitize-entry and sanitize-feed to apply the function to each string field

jfroelich commented 5 years ago

https://unicode.org/reports/tr15/

jfroelich commented 5 years ago

From the spec:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate.

So, probably want to avoid the specialized form even though it is more compact, because of the risk of loss of meaning. so the default NFC probably is the one I want, so just calling String.prototype.normalize without a second argument so that it defaults to NFC is probably what I want to do.

jfroelich commented 5 years ago

Side question: does normalization occur when using Response.prototype.text or innerHTML? If so then normalization is already performed implicitly elsewhere and all of this is a waste of time other than the learning aspect.

jfroelich commented 5 years ago

Take special note of section 1.4 regarding concatenation. The basic takeaway is that if I plan to break apart a string, change its parts, then recompose it, normalization should wait until after that process, it should be waiting until after the time any changes are going to be made and all concatenations are completed, so that it ensures that concatenation does not destroy the normalization and defeat the entire point of this exercise.

jfroelich commented 5 years ago

That last note shifts the scales a bit regarding when normalization should be performed. This suggests that it is best at the model layer, just before updating the persistent storage model, because we know at that point that no more changes will be made, and can contractually warrant it by having it be encapsulated within a more opaque API surface that protects the value's immutability once it is within the model function body.

jfroelich commented 5 years ago

So, in summary: change sanitize-feed and sanitize-entry to apply string normalization, and document that the caller should not make any more changes to values after those functions have been called. Or place the functionality within the update-feed/entry functions to enforce it and basically completely remove any caller discretion.

jfroelich commented 5 years ago

I need to enforce normalization within the model within every function that does insert or update. What is the best way to implement this? Do I even want this feature to be fully encapsulated within the model and abstracted away (information hiding)? Do I want a function like normalize-entry or Entry.prototype.normalize? Or a private helper that each state-modifier function calls internally?

jfroelich commented 5 years ago

Work remaining:

implement normalize-feed and normalize-feed-test
remove normalize-string-properties and its test

jfroelich / rss-reader

Unicode string normalization #770