notanumber / xapian-haystack

A Xapian backend for Haystack
GNU General Public License v2.0
154 stars 93 forks source link

Issue when using `auto_query()` with stemming #146

Open ecstaticpeon opened 9 years ago

ecstaticpeon commented 9 years ago

I have an index with items containing the word "voyage", and others "voyager". When doing a search for "voyage" using auto_query(), the backend returns the items containing "voyager" first, although one would expect the items with "voyage" to be first. However, when using filter(), the ordering appears correct (e.g. first "voyage", then "voyager").

After doing some investigation, it looks like the query returned by XapianSearchQuery.build_query() is different depending on whether auto_query() or filter() is used:

from haystack.query import SearchQuerySet

search_query_set = SearchQuerySet()

search_query_set.auto_query(u'voyage')
# build_query() returns `Xapian::Query(Zvoyag:(pos=1))`
# Results are "voyager" first, then "voyage".

search_query_set.filter(content=u'voyage')
# build_query() returns `Xapian::Query((Zvoyag OR voyage))`
# Results are "voyage" first, then "voyager".

Looking at XapianSearchQuery._filter_contains(), which will be called when using filter(), the docstring specify the search will be done on both the stemmed and un-stemmed term: "Splits the sentence in terms and join them with OR, using stemmed and un-stemmed."

Shouldn't using auto_query() end up using both stemmed and un-stemmed terms as well?

Versions used:

Xapian: 1.3.2 xapian-haystack: 3e8611265ec63522d4e3d81b45de3866f48853ee (from 12 January 2015).

ecstaticpeon commented 9 years ago

Ignore the ordering issue, this is actually related to our index. The question remain though: shouldn't using auto_query() end up using both stemmed and un-stemmed terms as well?

jorgecarleitao commented 9 years ago

Thanks for using Xapian-Haystack and for reporting this here.

In principle I agree with the consistency you mentioned. However, I'm not sure this is what we want since the auto_query receives a query, not a term. E.g. what would be the stemmed version of Hello OR bye OR che*rs?

See here what keywords it accepts.

ecstaticpeon commented 9 years ago

Thanks for making Xapian-Haystack :)

As far as I understand, the query will be split by terms? And therefore stemming wil be applied to each of the terms when applicable?

jorgecarleitao commented 9 years ago

I'm not sure the query is split in terms by Xapian-Haystack. In this line, the "term" is prepared by haystack and sent to the backend to be interpreted (self.backend.parse_query(query)). We just add the field_name:%s to the term in case it is made on a specific field.

Can you point out where in the code it is split by terms?