opentexts / opentexts

Register of open digitised texts
MIT License
3 stars 2 forks source link

Search field prioritisation #153

Open brizee opened 3 years ago

brizee commented 3 years ago

Some related user feedback here:

Specific searches threw up more than expected , maybe misunderstanding of the search engine ? ‘George Orwell’+ 1984 provides many more results than expected. A one-word search e.g. Orwell yields a mix of the things to do with the place Orwell, books by the author George Orwell and books where Orwell is mentioned. When you add George to search George Orwell you don’t get all George + Orwell items first it just seems a haphazard mix and Orwell only results are mixed in with George + Orwell. Would it be either an option or just a hard-wired practice to list first Orwell where Orwell is the author, then where Orwell occurs in the title of a book, then Orwell when mentioned in a book. Where one has George + Orwell surely the combination of both should be listed first.

I suspect people are going to try Google-style searches here (note: I have no idea what these are or how to use them, so they're probably a pro-level feature!) so it would be good to follow that model whenever we implement this. We'll also want to add instructions to the help page (#7), which I am happy to help with because if I can understand it, probably anyone can. 😜

Originally posted by @sarahmonster in https://github.com/opentexts/opentexts/issues/90#issuecomment-694764499

Part of this feedback is that certain fields should be weighted more heavily in results - it looks like this IS possible in Solr - are we already doing this and/or should we consider it? :smile:

https://stackoverflow.com/questions/16404228/solr-high-priority-in-fields

stuartlewis commented 3 years ago

Right now we have an out the box solr configuration, so nothing is weighted at all. Definitly something we can look into, I guess for title and creator mainly?

brizee commented 3 years ago

Would it be either an option or just a hard-wired practice to list first Orwell where Orwell is the author, then where Orwell occurs in the title of a book, then Orwell when mentioned in a book. Where one has George + Orwell surely the combination of both should be listed first.

Seems to chime with the feedback yeah, probably something we keep under review with analytics. I wonder if it's possible to detect how often search terms appeared in each field? Maybe count the highlights, though that might be too broad a measure. A manual review of query types would probably suffice?

sarahmonster commented 3 years ago

Having finally gotten around to reading the Help page (🙄) I actually think a key issue here is that we're defaulting to OR searching—ie, returning results that match any of the words in the search query—where users expect results to return an AND search.

This is how Google operates (https://support.google.com/websearch/answer/2466433?hl=en-PT&ref_topic=3081620) and I'd recommend that we follow that pattern—instead of having an explicit AND query, we should have an explicit OR query instead. This would better match user expectations and established patterns, as well as returning more relevant result sets by default.

brizee commented 3 years ago

I'm pretty convinced that isn't what Google does actually - I believe like us Google will return results that contain either but prefers results that contain both.

You'll quite often see Google results with a line informing you of this for more complex searches and the option to make it required: image

Making it required just adjusts the search to include the term in quotes. image

For simple searches this doesn't matter of course - their index is large enough that they'll easily fill many pages with results containing Fruit and Juice before moving to either or.

stuartlewis commented 3 years ago

Weighted search is reasonably easy to add in Search.php:


// Use dismax
$dismax = $query->getDisMax();
$dismax->setQueryFields('title^5 creator^5 year^5 publisher^5 placeOfPublication^2 description topic');