josiah-wolf-oberholtzer / discograph

Social Graphing for the Discogs Database
MIT License
74 stars 11 forks source link

Search should be diacritic-insensitive. #59

Closed inostia closed 8 years ago

inostia commented 8 years ago

Eg. Thomas Koner should return results for Thomas Köner

I've never contributed to an open source project but I could try to implement this.

josiah-wolf-oberholtzer commented 8 years ago

+100. This is a really good idea.

On a quick glance around Google, I'm not sure if this is really easy or really hard. Feel free to look around for solutions, or contribute one of your own.

FYI: I'm currently using Sqlite 3.8.11 as my storage engine in the deployment, with FT3 & FT4 enabled. My ORM is Peewee, and I'm using the "porter" tokenizer in Sqlite.

This article sounds helpful: http://www.swwritings.com/post/2013-05-04-diacritics-and-fts/

josiah-wolf-oberholtzer commented 8 years ago

I've done some experiments. Using the "unicode61" FTS tokenizer in Sqlite solves the diacritics problem.

However, the version of Sqlite compiled into out-of-the-box Python wasn't compiled with the appropriate flags to make this tokenizer work.

The solution? Don't use Python's built-in sqlite3 module, but APSW instead. APSW downloads and compiles in its own copy of sqlite3, independent of any system installation. It also lets you set all of the flags for fancy features, including the "unicode61" tokenizer.

inostia commented 8 years ago

Great! The blog post you pointed to seems to be pretty straightforward. Thanks for this the project's awesome.

josiah-wolf-oberholtzer commented 8 years ago

Fixed. Updated DB will go live later today.