internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
4.98k stars 1.25k forks source link

Make author name matching case insensitive #9390

Open scottbarnes opened 1 month ago

scottbarnes commented 1 month ago

Related: #9003, internetarchive/infogami#221

Problem

A clear and concise description of what you want to happen

On import, author name matching should be case insensitive.

Additional Context

internetarchive/infogami#217 changed ~ to use ILIKE rather than LIKE, and the Open Library code in #9003 relied upon this to perform case insensitive author name matching on import.

However, the Infogami ILIKE change caused performance issues and is slated to be reverted in internetarchive/infogami#221, with ~ doing a LIKE operation and ~i doing an ILIKE operation.

Once internetarchive/infogami#221 is merged, author name resolution will be case sensitive again. However, we can't simply update the Open Library code in openlibrary/catalog/add_book/load_book.py to use ~i, because of the performance issues associated with the ILIKE query, so we'll need to investigate further (perhaps using EXPLAIN can help us see more about the query.

Proposal & Constraints

What is the proposed solution / implementation?

None yet -- this will take more investigation to figure out why ILIKE was such significant performance issues.

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

tfmorris commented 1 month ago

Doesn't SOLR already do this? Is there more context available about why this needs to be done in PostgreSQL in this particular use case?

A few general comments:

cdrini commented 1 month ago

Solr might be what we have to do considering the performance issues with ILIKE. Note solr has a caveat of being 1 minute behind live edits. In the past when solr has been used to dedupe imports, it caused edge cases where it caused dupes with related books being imported in quick succession, so we'd always need a postgres backup check of some sort. The postgres ILIKE was hence a mandatory and simple change that would result in a large improvement in new authors being created. The plan was to add the solr checking as an improvement at some point in the future. But we might have to re-evaluate that strategy as mentioned above.

Oh sweet thanks for that trigram index find! When we investigate we'll see what it's currently using.