gsoules / AvantSearch

Omeka Classic plugin that provides extended search results capabilities for the public interface.
GNU General Public License v3.0
6 stars 5 forks source link

Stop words or 'do' in search result in no matches #5

Open DigExpCon opened 3 years ago

DigExpCon commented 3 years ago

For KEYWORD searches, we are getting some odd results with AvantSearch. Including stop words in the search string seems to invalidate the rest of the search, and there also seems to be an issue with case sensitivity.

For example, if you search for this exact title as a keyword search: https://cpw.cvlcollections.org/items/show/210

The full title results in no hits: https://cpw.cvlcollections.org/find?query=The+future+of+wildlife+conservation+funding%3A+What+options+do+U.S.+college+students+support%3F

A shortened version of the title results in no hits: https://cpw.cvlcollections.org/find?query=The+future+of+wildlife+conservation+funding

A shortened version with a lower-case 'the' does work: https://cpw.cvlcollections.org/find?query=the+future+of+wildlife+conservation+funding

A longer version that includes the word 'do' does not work (and 'do' is not in the list of InnoDB stop words): https://cpw.cvlcollections.org/find?query=the+future+of+wildlife+conservation+funding%3A+what+options+do+U.S.+college+students+support%3F

A longer version without the word 'do' does work: https://cpw.cvlcollections.org/find?query=the+future+of+wildlife+conservation+funding%3A+what+options+U.S.+college+students+support%3F

So, if the stop-words 'The' or 'What' or 'Of' are capitalized in the search string, the search finds no results. Or, if the word 'do' is in the search string (whether capitalized or not), no results are found.

The DB table search_texts is InnoDB, and its collation is utf8_unicode_ci (so should be case-insensitive).

The good news is that if you search for the full title, precisely as it appears in the record, as a Title search (in Advanced Search), the article comes up: https://cpw.cvlcollections.org/find?advanced%5B0%5D%5Bjoiner%5D=and&advanced%5B0%5D%5Belement_id%5D=50&advanced%5B0%5D%5Btype%5D=contains&advanced%5B0%5D%5Bterms%5D=The+future+of+wildlife+conservation+funding%3A+What+options+do+U.S.+college+students+support%3F&layout=1

DigExpCon commented 3 years ago

In models/SearchQueryBuilder.php I changed line 5 to set const MIN_KEYWORD_LENGTH = 2 instead of 3, and added on line 166 $query = strtolower($query); to convert the query to lower case before executing a search. These seem to have resolved the oddities we encountered above.