rdmenezes / thrudb

Automatically exported from code.google.com/p/thrudb
1 stars 0 forks source link

Queries for upper-case data in keyword fields never match #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Index a KEYWORD field with upper-case data, like 'Hello'
2. Query Thrudex for 'Hello'
3. Nothing returned.

I think the query analyser is downcasing all queries. I created two
indexes, kw and txt. kw indexes as KEYWORD fields, txt as TEXT fields. Now
look what happens:

>>> res, cnt = a.query_index('foo:"Hello"', bucket='kw'); print cnt
0
>>> res, cnt = a.query_index('foo:"Hello"', bucket='txt'); print cnt
1

I found this in the Clucene docs:
"Also be careful with Fields that are not tokenized (like Keywords). During
indexation, the Analyzer won't be called for these fields, but for a
search, the QueryParser can't know this and will pass all search strings
through the selected Analyzer. Usually searches for Keywords are
constructed in code, but during development it can be handy to use general
purpose tools (e.g. Luke) to examine your index. Those tools won't know
which fields are tokenized either. In the contrib/analyzers area there's a
KeywordTokenizer with an example KeywordAnalyzer for cases like this."

http://clucene.wiki.sourceforge.net/Official_CLucene_FAQ?f=print

So I'm guessing that the analyzer/parser Thrudex uses is downcasing
everything (and possibly stemming), which means that the a query for
"Hello" becomes "hello" regardless whether accountid is indexed as a TEXT
or KEYWORD field. And if it's a KEYWORD field, the match fails.

Workaround:
For now I can convert all data intended for KEYWORD fields to a base-36
alphabet, but dayyym. We should put a big note in the docs about this,
and/or provide a way to "construct searches for Keywords in code".

Original issue reported on code.google.com by aris...@gmail.com on 2 Feb 2009 at 7:27

GoogleCodeExporter commented 9 years ago
KEYWORD fields also fail to match if the first character is a digit. :(

(indexed an object: {'bar':'z1234', 'baz':'1234z'} )

>>> res, cnt = a.query_index('bar:z1234', bucket='foo'); print cnt
1
>>> res, cnt = a.query_index('baz:1234z', bucket='foo'); print cnt
0

Original comment by aris...@gmail.com on 3 Feb 2009 at 2:19