notanumber / xapian-haystack

A Xapian backend for Haystack
GNU General Public License v2.0
154 stars 93 forks source link

[Support Request] Case insensitive search? #154

Open coredumperror opened 9 years ago

coredumperror commented 9 years ago

Can the Xapian backend offer case insensitive search? I just switched to Xapian from Whoosh, and without changing any of my code, my searches became case sensitive. I would like to have case insensitive searches, but I have no idea how to turn off case sensitivity.

coredumperror commented 9 years ago

Upon further experimentation, this appears to be related to whether or not I use AutoQuery.

This SQS (with query='email only'):

SearchQuerySet().filter(content=query).load_all()

finds all the results with either "email" or "only" in their indexes, and its not case sensitive. However, setting query='"Email only"' (quoted search phrase) gives no results, even though there is a model with that exact phrase, including case, in the index.

But this SQS (with query='email only'):

SearchQuerySet().filter(content=AutoQuery(query)).load_all()

Gives no results at all. Setting query='"Email only"' (quoted search phase) gets the one result with that exact string, including case, appearing in its index. And yet query='Email only' (no quotes) again gives no results.

What am I doing wrong, here? These results don't seem to make much sense.

jorgecarleitao commented 9 years ago

Thanks for bringing this to here.

Just to try to understand, you would like to have a way to use filter(...) such that it returns documents with an exact match of two words (e.g. Email only), but you want this to be case insensitive.

What you are observing is that 1) content never gives exact matches (that is expected from the code), and 2) AutoQuery gives inconsistent results with cases (which seem to be in line with this SO question and this thread.

Is this correct?

The query .filter(content='"Email only"') should not give any good results since it interprets the quotes as part of the string (they are supposed to be used when the query is feed to AutoQuery). Currently there is no way to do exact matches in field "content", but it could be implemented. I think what you want is a insensitive "PHRASE" search ("Email PHRASE only"). However, one would need to test that case insensitive exact searches would work... I'm not sure now.

The AutoQuery would be an issue of the Xapian itself. I think we could create a minimal example and confirm the issue in this backend and use it to report to Xapian.

In any case, I agree that this backend is not so well tested when it comes to case sensitiveness. I will see if I get time to improve this.

Thanks for your time and effort in getting things together here.

coredumperror commented 9 years ago

Ah yes, I'd forgotten that using .filter(content='"Email only"') doesn't make sense. I was just throwing stuff at the wall to see what stuck.

The real problem is that both of these queries give zero results, when they absolutely shouldn't:

SearchQuerySet().filter(content=AutoQuery('Email only')).load_all()
SearchQuerySet().filter(content=AutoQuery('email')).load_all()

There are objects in my search index which exactly match "Email only", but they don't appear. There are also objects which exactly match "email", but they also don't appear.

Bizarrely enough, this query returns the results I would have expected from both of the other two:

SearchQuerySet().filter(content=AutoQuery('Email')).load_all()

So I'm thinking this isn't actually a case sensitivity problem. Searching for "Email" is returning results which only match "email". Yet searching for "email" doesn't return those results. The same thing happens with "Invoice".

How the heck is that that happening?? Is there any way to look at the contents of the Xapian index to see if it's somehow capitalizing words that it shouldn't be? Or maybe I'm writing my SearchIndex classes badly?

class ObservingRunIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(document=True, use_template=True)

    def get_model(self):
        return ObservingRun

Is that too little code to make it work properly with Xapian, perhaps?

jorgecarleitao commented 9 years ago

I agree that the results are odd.

I'm not entirely familiar with what the QueryParser of Xapian does exactly, but maybe the different results you are getting could be because of

If a query term is entered with a capitalised first letter, then it will be searched for unstemmed.

http://xapian.org/docs/queryparser.html

Yes, I agree with you that the natural path is to see what was indexed in the first place. There is a way to check which terms are on Xapian index, which we use in our test cases:

# in tests/test_backend.py

def get_terms(backend, *args):
    result = subprocess.check_output(['delve'] + list(args) + [backend.path],
                                     env=os.environ.copy()).decode('utf-8')
    result = result.split(": ")[1].strip()
    return result.split(" ")

At this moment, I think it may be worth to have a minimal example (with e.g. just a couple of entries) where the problem is demonstrated to compare what we get with what what we would expect, so we can think of a solution.

Again, thanks a lot for taking the time.

pembo13 commented 5 years ago

Whatever happened with this?