ukparliament / parliament.uk-prototype

Parliament.uk prototype is an early incarnation of a new parliament website. Please see the README for more information.
https://beta.parliament.uk/
MIT License
5 stars 13 forks source link

Unicode normalization #227

Open langsamu opened 7 years ago

langsamu commented 7 years ago

tl;dr ö -> o in Ruby.

We discussed getting a list of members by their initials, mentioning some that have non-english characters, like Ö. We said one correct solution is to show those characters in the list of initials. I mentioned using Unicode knowledge of combining diacritical marks to strip the accents.

I was talking about Unicode normalization forms, which "make it possible to determine whether any two Unicode strings are equivalent to each other" by "put[ing] all combining marks in a specified order".

So ö (U+00F6: Latin Small Letter O With Diaeresis) becomes o◌̈ (U+006F: Latin Small Letter O + U+0308: Combining Diaeresis). You can then take the first character of the resulting pair (could be longer), and use that.

Unicode-aware languages/frameworks understand this in their string classes. Here's an example in .NET/C#, using the Normalize method on the String class.

Ruby does the same in UnicodeNormalize (since 2.2), the equivalent method is unicode_normalize.

There's also transliterate in ActiveSupport. I wouldn't know which one's better.

langsamu commented 7 years ago

I just realized this doesn't really solve the problem, since we'd need SPARQL to do the same, not Ruby.

mattrayner commented 7 years ago

Part of a 'fix' for this is loading all of our alphabet letters from the parcel endpoint - however having looked through some queries our logic is as follows (on the front end):

Get a list of 'available' letters from the sparql endpoint (letters that contain people results) For every letter from A-Z: Output an element to the screen: If the letter (a-z) appears in our list of available letters: Add a class to take it clickable

The fundamental flaw with this is that we are only checking A-Z which means results that include unicode letters do not get included in the A-Z lists at the top of some pages.