Combining characters UTF-8

problem is that some special characters, like ç (c cedilla) can be expressed in multiple ways in unicode, either a pre-composed one or by use of combining diacritics (cf. http://unicode.org/faq/char_combmark.html).

for example, the "latin character c with cedilla" can be expressed using pre-composed U+00E7 (encoded in "Controls and Latin -1 Supplement" as in François OR using a combination of character "c" from "Lowercase Latin alphabet" (ASCII) and the "Combining cedilla" from "Combining Diacritical Marks": U+0063 U+0327 as in François‏

this problem requires solutions in two ways: a) characters need to be normalized - either on input (ziziphus) or for search (tamboti) so that either François or François is used consistently b) we should try to include all variants of search terms (names, concepts, locations, etc) in the search so that users can search for "法蘭索瓦·布雪", or "Boucher, Frances", or "Bushe, Fransua", or "François Boucher‏" and will allways find the person http://viaf.org/viaf/66517093 ("Boucher, François, 1703-1770") for ULAN a fuzzy matching called "ulanr" was developed by matthew lincoln, cf. http://matthewlincoln.net/2016/03/11/ulanr-3-0.html

hcts-hra / ziziphus

Combining characters UTF-8 #400