internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.06k stars 1.31k forks source link

Add solr support for synonyms for numbers/abbreviations #6635

Open popcar2 opened 2 years ago

popcar2 commented 2 years ago

I've been using the website for a long time now and one of my biggest gripes is how searching works. When searching for books in OpenLibrary, you often need to write exactly the correct title. This means that if a book uses words for numbers (One, Two, Three etc), searching the same title with digits (1, 2, 3 etc) would give no result.

Another example is if a book uses "Vol." in the title, searching "volume" would net no result even though they mean the same thing. This makes finding specific books a lot more difficult.

Describe the problem that you'd like solved

The search engine searches exact terms, but it should have tolerance when dealing with numbers or words of equivalent meaning. Here's an example:

image image I would like searching "The Walking Dead Compendium Four" and "The Walking Dead Compendium 4" to find the book.

Proposal & Constraints

The search engine should be error tolerant to words of the same meaning. "Vol." should be the same as writing "Volume" "Two" should be the same as writing "2" or "II" "&" and "and" should also be interchangeable.

Additional context

Another example, but with "vol" and "volume" image image

cdrini commented 2 years ago

I think the solution for this would be to make use of solr's synonyms feature. But some experimenting / investigation needed. Anyone who has some time to experiment with adding synonyms to solr, please do!

bicolino34 commented 2 years ago

@cdrini I would like to do it, how can I?

bicolino34 commented 2 years ago

The search is strict not only with terms, but also with letters. Compare Безпека життєдіяльності and Безпека життєдіяльност. With just one letter missing (і) there are no results

cdrini commented 2 years ago

So this is a solr research task; here are some of places where it will need modifications:

The solr schema which defines the various type of text fields has synonyms enabled -- but only at query time:

https://github.com/internetarchive/openlibrary/blob/82bc2f61c8c41363567d398b7b027a16775dbc91/conf/solr/conf/managed-schema#L426-L467

This blog post has some info: https://library.brown.edu/create/digitaltechnologies/using-synonyms-in-solr/

In a nutshell we need synonyms inside https://github.com/internetarchive/openlibrary/blob/ccabd95be2a82c4f79d94b1f10e46ea1d3c5c730/conf/solr/conf/synonyms.txt

And then test locally with a full reindex (See https://github.com/internetarchive/openlibrary/wiki/Solr#making-changes-to-solr-config )

But for numbers, they probably need to be in English only for now? I'm not sure how we should handle non-English numbers. Ideally we'd want different synonyms files for different user locales, but I'm not sure if/how to do this in solr.

cdrini commented 2 years ago

But we can definitely add something like vol,vol.,Volume in there and see if it helps with that!

cdrini commented 2 years ago

Actually it looks like the synonyms file is working! You can see the TV one in action here: https://openlibrary.org/search?q=television+kid&mode=everything .

So adding volume should be easy enough!

cdrini commented 1 year ago

@bicolino34 For your issue, that would probably be handled by solr's spell checking features. So having something like "Did you mean?" when a user's query is close to be not perfectly correct. Would you mind creating a separate issue to add support for "Did you mean?" ? That'll require a different approach on the solr side, but would help users a ton!

mekarpeles commented 1 month ago

Is this addressed by #6922? Can this issue be consolidated into that one?

tfmorris commented 1 month ago

6922 is the PR which is meant to address this issue, but it's stuck in review and has some issues.

cdrini commented 1 month ago

Yep; if someone with solr knowledge has the time to take that branch and clean it up and test it, that would be a huge help!