Any interest in UTR#30 Normalization?

jrochkind commented 11 years ago

While UTR#30 seems to maybe(?) been formally abandoned as part of Unicode, it is still used by some software, such as current versions of Solr's ICUFolderFilterFactory

I am actually using Solr, and it would be very useful to me (in a complicated use case, but, really) to be able to duplicate what the UTR#30 transformation Solr is doing, but in pure ruby.

The translation files that Solr is using appear to be here

I don't entirely understand how all these unicode transformations work, but I believe that sort of transformation file is exactly what twitter-cldr-rb already uses for various other sorts of unicode transformations?

So perhaps the logic is already there in twitter-cldr-rb to take advantage of these translation files to do the UTR#30 transformation, just with the data file?

Is there any interest in having twitter-cldr-rb actually support UTR#30? Alternately, is there any easy way I can custom hack twitter-cldr-rb to use it's existing logic for dealing with this sort of unicode translation file (sorry, don't know what these are actually called), but feed it those UTR#30 mappings cribbed from the Solr source?

Thanks for any ideas.

camertron commented 11 years ago

Hey @jrochkind,

This looks pretty interesting, but unfortunately we don't have plans to incorporate UTR#30 into TwitterCLDR at the moment. At first glance, it seems like a fairly straightforward algorithm, and I would happily accept a pull request. TwitterCLDR's current transformations are really normalizations, one of which UTR#30 specifically depends on (NFD), so at least that's already done. You can make use of NFD normalization using the corresponding class:

TwitterCldr::Normalization::NFD.normalize(text)

# alternatively:
text.localize.normalize(:using => :NFD)

Good luck!

jrochkind commented 11 years ago

Thanks! I may try a pull request in the future.

jrochkind commented 11 years ago

Do you have any advice as to how to use the mapping data files of the sort here with TwitterCLDR? That is, is there already a part of TwitterCLDR written to use this kind of mapping data, but applied to other mapping data? I ask because it seems like this may be some kind of standard unicode mapping data file, I'm not sure.

camertron commented 11 years ago

It looks like those files contain a series of folding rules that map one character (or range of characters) to another. The algorithm in UTR#30 says to perform the following steps:

a. Apply optional folding operations (i.e. rules from the solr files) b. Apply canonical decomposition (described above) c. Repeat (a) and (b) until stable (I think "stable" means "until you can't decompose any more") d. Apply composition if necessary (only if you want the string in composed form, based on your technical requirements)

Applying a folding operation might look something like this: given a rule like 058A>002D, every time you encounter a "058A" character, you'd replace it with "002D". Bear in mind that I only took a cursory glance over the UTR#30 spec, so that might be incorrect. Indeed, the spec is quite a bit more complicated than that.

If you do decide to work on this feature and submit a PR, I'd suggest looking around for a test file. Unicode publishes a set of test data (inputs and correct outputs) for algorithms like normalization and bidi, so it's possible they have one for folding as well.

Good luck!

twitter / twitter-cldr-rb

Any interest in UTR#30 Normalization? #100