moremore0812 / cqengine

Automatically exported from code.google.com/p/cqengine
0 stars 0 forks source link

Enhancement Req : support for case insensitive search/query. #3

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Query on indexed string field with contains or starts with or equals etc.
2.Will return case sensitive result.
3.

What is the expected output? What do you see instead?
There should any way to perform case insensitive comparisons. Currently achived 
this by adding one morw attribute in POJO and indexed this rather than desired 
attribute and in this attribute stored lower case version of the desired 
attribute. Then performed search with search term converted to lowercase.

This will be good to have if possible without sacrificing performance.

What version of the product are you using? On what operating system?
0.9.1-all

Please provide any additional information below.

Original issue reported on code.google.com by SylentPr...@gmail.com on 25 Oct 2012 at 3:14

GoogleCodeExporter commented 9 years ago
The approach you took for case-insensitive queries on an indexed string, is 
actually the way to get the best performance.

There are two ways actually, the alternative one uses less memory but isn't 
quite as fast in all cases.

The first one which you described: you add an extra field in the POJO to store 
a lowercase version of the string in the POJO, and then define an attribute on 
the lowercase version:

    public static final Attribute<Car, String> NAME = new SimpleAttribute<Car, String>("name") {
        public String getValue(Car car) { return car.nameInLowercase; }
    };

Alternatively, you could define the attribute as a function on the mixed-case 
string:

    public static final Attribute<Car, String> NAME = new SimpleAttribute<Car, String>("name") {
        public String getValue(Car car) { return car.name.toLowerCase(); }
    };

There are slight differences in performance between the two. The first one will 
be fastest in all cases, but will use more memory (i.e. storing 2 versions of 
the string). The second one will be fast if you build an index on the 
attribute, AND the index gets used to answer your queries. If the index doesn't 
get used for some queries (i.e. it isn't suitable for some query, or CQEngine 
thinks another index will be faster), then once CQEngine has built a candidate 
set from other indexes, it will use this attribute to filter results and will 
end up converting the name to lowercase at runtime. If memory isn't an issue 
then the first option is fastest.

It's not realistic for indexes themselves to support case-insensitive queries. 
The letters 'A' and 'a' are represented by different bytes, so navigating 
indexes in a case-insensitive manner would degrade performance. The easiest 
solution is usually to just build the index on either lowercase or uppercase 
versions of strings, and then convert the string in the query to lowercase or 
uppercase accordingly. As you have done :)

However, it might be possible to enhance attributes to flag them as being 
case-insensitive. That way, if CQEngine encountered a query on a 
case-insensitive attribute, it could automatically convert the query string to 
lowercase. Then you wouldn't need to think about it when writing queries. I'll 
think about this and probably bundle it in with the changes per the 
null-handling discussion. Thanks!

Original comment by ni...@npgall.com on 25 Oct 2012 at 11:36

GoogleCodeExporter commented 9 years ago

Original comment by ni...@npgall.com on 29 Oct 2012 at 9:55

GoogleCodeExporter commented 9 years ago
I'm shelving this idea for the time being.

This would be a useful feature, but I'm not sure about the cost-benefit of 
implementing it. I'm only about 40% in favour and 60% against this idea right 
now though, so if others would like it added, please add an "I want this" 
comment to this issue to vote for it.

It's currently fairly easy to have case-insensitive retrieval, using the 
approach above, so this really is a nice-to-have feature.

Implementing this feature, would probably require adding two new types of 
attribute: SimpleCaseInsensitiveAttribute and 
MultiValueCaseInsensitiveAttribute. I'd consider any patches to add the feature 
and I'm happy to answer questions from anyone who really wants to implement it.

"If in doubt, leave it out" is the motto for the time being. 

Original comment by ni...@npgall.com on 18 Nov 2012 at 10:35