hcts-hra / ziziphus

10 stars 2 forks source link

Combining characters UTF-8 #400

Open SimonGruening opened 8 years ago

SimonGruening commented 8 years ago

Seemingly in some cases the display values coming from VIAF [maybe also other controlled vocabs, not tested yet] use combining characters instead of single characters in UTF-8. example

The problem in these cases is, that the Tamboti search doesn't find those terms and therefore the related records won't be displayed in the search results.

Suggestion: It would be useful to make all term variations [called "forms" in VIAF] available for searching to minimize the probability that a term cannot be found.

MatthiasArnold commented 8 years ago

problem is that some special characters, like ç (c cedilla) can be expressed in multiple ways in unicode, either a pre-composed one or by use of combining diacritics (cf. http://unicode.org/faq/char_combmark.html).

for example, the "latin character c with cedilla" can be expressed using pre-composed U+00E7 (encoded in "Controls and Latin -1 Supplement" as in François OR using a combination of character "c" from "Lowercase Latin alphabet" (ASCII) and the "Combining cedilla" from "Combining Diacritical Marks": U+0063 U+0327 as in François‏

this problem requires solutions in two ways: a) characters need to be normalized - either on input (ziziphus) or for search (tamboti) so that either François or François is used consistently b) we should try to include all variants of search terms (names, concepts, locations, etc) in the search so that users can search for "法蘭索瓦·布雪", or "Boucher, Frances", or "Bushe, Fransua", or "François Boucher‏" and will allways find the person http://viaf.org/viaf/66517093 ("Boucher, François, 1703-1770") for ULAN a fuzzy matching called "ulanr" was developed by matthew lincoln, cf. http://matthewlincoln.net/2016/03/11/ulanr-3-0.html