mchaput / whoosh

Pure-Python full-text search library
Other
569 stars 69 forks source link

Single character got removed in QueryParser #39

Closed davidshen84 closed 1 year ago

davidshen84 commented 1 year ago

Hi,

I have documents with literal numbers like 1, 123 and 456. I can perform searches like 123 OR 456 but can't do searches like 1 OR 123.

parser = SimpleParser('content', Schema(id=ID(stored=True), path=ID(stored=True), content=TEXT()))
parser.parse('123 AND 1')

I found the literal 1 is removed. If I use 01, the term is preserved in the query parser. But I index 1, not 01.

I experimented with other single characters, like a, and they are all removed. But I could not find any document mentioning this.

ccomb commented 1 year ago

I lost time trying to understand the same issue, while searching stuff like "pig system 3" And I found this https://stackoverflow.com/questions/28010087/python-whoosh-not-accepting-single-character

However I'm wondering what it would imply to default the minsize to 1 ?

davidshen84 commented 1 year ago

I did not read the code. I think minsize means the minimum length of a token. A token is removed if its length is less than the minsize, regardless if it is listed in the stoplist argument.

davidshen84 commented 1 year ago

The SO post looks promising. However, I don't have time to test it. I will close this issue for now.