BTW, I found that create_romanized_query_txt and make_or_regexp may cause many problems in handling Chinese name and Japanese name (Google/master branch has the same problem), for example:
A Chinese person: 姚明, romanize: family name 姚 -> Yao, given name 明-> Ming.
Usually mainland people and Hong Kong people put their family name on the right hand side when they romanise their name. On passport, usually the english version name would be: 'Ming Yao'.
Note that Taiwan people usually use a different order: Family_name given_name for english version of name.
And we may call the full_text_searchusing his Chinese name:
full_text_search.search('haiti', u'姚明', 5)
then the and_query will become u'("姚明" OR "Yao Ming") AND (repo: haiti)'. since "Yao Ming" is an atomic term, and then the search API will return nothing as the index only has "Ming Yao". Furthermore, make_or_regexp also has the same problem: it assumes that the document field and query are always in the same order.
Additionally, even in handling Japanese name, if the document only has a romaji full_name and the query are kanji without space, then the result will also be nothing as the romanisation of query does not try to split the kanji name without space into a family name and a given name.
Copied from https://github.com/google/personfinder/pull/282#issuecomment-245865103
BTW, I found that create_romanized_query_txt and make_or_regexp may cause many problems in handling Chinese name and Japanese name (Google/master branch has the same problem), for example:
A Chinese person: 姚明, romanize: family name 姚 -> Yao, given name 明-> Ming.
Usually mainland people and Hong Kong people put their family name on the right hand side when they romanise their name. On passport, usually the english version name would be: 'Ming Yao'.
Note that Taiwan people usually use a different order:
Family_name given_name
for english version of name.If we stored his info in this way
And we may call the
full_text_search
using his Chinese name:full_text_search.search('haiti', u'姚明', 5)
then the
and_query
will becomeu'("姚明" OR "Yao Ming") AND (repo: haiti)'
. since"Yao Ming"
is an atomic term, and then the search API will return nothing as the index only has "Ming Yao". Furthermore,make_or_regexp
also has the same problem: it assumes that the document field and query are always in the same order.Additionally, even in handling Japanese name, if the document only has a romaji full_name and the query are kanji without space, then the result will also be nothing as the romanisation of query does not try to split the kanji name without space into a family name and a given name.