Open SimonGruening opened 8 years ago
problem is that some special characters, like ç (c cedilla) can be expressed in multiple ways in unicode, either a pre-composed one or by use of combining diacritics (cf. http://unicode.org/faq/char_combmark.html).
for example, the "latin character c with cedilla" can be expressed using pre-composed U+00E7 (encoded in "Controls and Latin -1 Supplement" as in François OR using a combination of character "c" from "Lowercase Latin alphabet" (ASCII) and the "Combining cedilla" from "Combining Diacritical Marks": U+0063 U+0327 as in François
this problem requires solutions in two ways:
a) characters need to be normalized - either on input (ziziphus) or for search (tamboti)
so that either François
or François
is used consistently
b) we should try to include all variants of search terms (names, concepts, locations, etc) in the search
so that users can search for "法蘭索瓦·布雪", or "Boucher, Frances", or "Bushe, Fransua", or "François Boucher" and will allways find the person http://viaf.org/viaf/66517093 ("Boucher, François, 1703-1770")
for ULAN a fuzzy matching called "ulanr" was developed by matthew lincoln, cf. http://matthewlincoln.net/2016/03/11/ulanr-3-0.html
Seemingly in some cases the display values coming from VIAF [maybe also other controlled vocabs, not tested yet] use combining characters instead of single characters in UTF-8. example
The problem in these cases is, that the Tamboti search doesn't find those terms and therefore the related records won't be displayed in the search results.
Suggestion: It would be useful to make all term variations [called "forms" in VIAF] available for searching to minimize the probability that a term cannot be found.