machinalis / quepy

A python framework to transform natural language questions to queries in a database query language.
Other
1.25k stars 296 forks source link

Unicode rears its ugly head #8

Open ketsuban opened 10 years ago

ketsuban commented 10 years ago

The query "What is Pokémon?" silently drops the é and then tells me there is no result in the database for "Pokmon". "What is Tyranitar?", on the other hand, got me this gem from Freebase:

are one of the 493 fictional species of Pokᅢᄅmon creatures from the multi-billion-dollar Pokᅢᄅmon media franchise, designed by Ken Sugimori. The purpose of Tyranitar in the games, anime, and manga, as with all other Pokᅢᄅmon, is to battle both wild Pokᅢᄅmon¬タヤuntamed creatures that characters encounter while embarking on various adventures and tamed Pokᅢᄅmon creatures owned by Pokᅢᄅmon trainer.

In case Github eats them, the two middle characters are U+FFC3 HALFWIDTH HANGUL LETTER AE and U+FFA9 HALFWIDTH HANGUL LETTER RIEUL - how U+00E9 LATIN SMALL LETTER E WITH ACUTE ended up as that is beyond me. (It may be a bug in the database, because the query "What is Pikachu?" gets back a correctly formatted page.)