npgall / cqengine

Ultra-fast SQL-like queries on Java collections
Apache License 2.0
1.72k stars 252 forks source link

Support for Map as collection object? #62

Closed kimptoc closed 8 years ago

kimptoc commented 8 years ago

Any plans to include for support Map as the collection object or would you accept a pull request to add it?

I can currently do something like this:

new SimpleAttribute<Map, String>(attributeName) {
            @Override
            public String getValue(Map o, QueryOptions queryOptions) {
                return (String) o.get(attributeName);
            }
        };

But would good to have something more builtin, eg AttributeFromMap

Cheers, Chris

npgall commented 8 years ago

Hi Chris,

There are a few challenges, but I do think this is a good idea, and this general approach could be supported better.

FYI this has come up a couple of times in the discussion forum [1] [2].

The first challenge is that there is no map-based attribute - as you mentioned! - so people need to use the workaround you suggested.

The second and main challenge though, is that the implementation of equals() and hashCode() in maps and indeed in most collections can be quite expensive. When maps and collections are nested then the outer collections tend to calls equals() and hashCode() quite a lot, which means the more nesting there is the worse the problem becomes. This is the reason why I have not provided a map-based attribute so far :)

So the approach I recommend for now, is the one I mentioned in my second link above, which is to write a POJO class which wraps a map and has an efficient implementation of equals() and hashCode(). For example it could cache the hashCode(), and it could use some special entry in the map which you designate as a primary key, to avoid iterating every map entry when equals() is called. You can then store these map-wrapping-POJO objects in the collection.

At least, this solves the problem where you might want each object to have a dynamic set of fields which can be changed without recompiling, without sacrificing performance.

However the approach is a bit of a kludge.

I've done some experimenting and I think I could add support to QueryFactory to create a mapAttribute() ad-hoc. I will commit it to this branch: https://github.com/npgall/cqengine/tree/issue-62-map-attributes

If you'd like to tackle the other problem - via a pull request to that branch - and write some kind of Map class which wraps another Map but caches the hashcode, and supports an efficient implementation of equals() as above, it would be great!

Thanks, Niall

kimptoc commented 8 years ago

Thanks Niall - will give it a go.

Any hints on which package to put it in - e.g. com.googlecode.cqengine.util, ...map or maybe ...helper?

Any preference on class name - FastMap ?

Will probably start with some tests on the performance of pure map based vs FastMap.

Cheers.

npgall commented 8 years ago

Hi Chris,

How about? -

com.googlecode.cqengine.entity.MapEntity

I will have to include this feature in the documentation so I was trying to think of the correct terminology to use. The best I could think of was to borrow this “entity” terminology from JPA! There are very few synonyms for “Object” which aren’t already overloaded.

I was also thinking we could put a static factory method to instantiate these objects into QueryFactory so maps could be added as follows.

indexedCollection.add(mapEntity(myMap)); Let’s see how it goes. We could decide and rename later.

Thanks! Niall

kimptoc commented 8 years ago

Noted.

Have had a quick go at an Object vs Map performance comparison (https://github.com/kimptoc/cqengine-test). Need to add some warm up time and balance queries vs size more.

First runs had the Map version being 10x slower, but now its showing comparable times... doh

npgall commented 8 years ago

Yes you need some warm up time, as the first run will be slow due to lazy class loading. Also note the ResultSet.size() method can sometimes work by reading statistics from indexes, without doing much work. So you might need to iterate the results to measure the retrieval speed.

The ResultSet.size() method probably doesn't trigger Map.equals() and Map.hashCode() to be called much. However if you add some deduplication into the mix, you should see a bigger difference.

For example when I modify your test to add deduplication with the patch below, I got these results which suggests that the Car POJO is nearly twice as fast as a plain Map:

CQEngine Test - starting
cars created = 278745
elapsed = 1548.5555ms for 100 queries. Time per query:15.4856ms
cars created = 278745
elapsed = 2989.8012ms for 100 queries. Time per query:29.8980ms
CQEngine Test - done

Here's what I changed to add deduplicaton:

Index: src/main/java/Test1.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/main/java/Test1.java    (revision cfd414a936be4bb7d96c5424dfd3d708ae534b90)
+++ src/main/java/Test1.java    (revision )
@@ -3,6 +3,7 @@
 import com.googlecode.cqengine.attribute.SimpleAttribute;
 import com.googlecode.cqengine.index.hash.HashIndex;
 import com.googlecode.cqengine.query.Query;
+import com.googlecode.cqengine.query.option.DeduplicationStrategy;
 import com.googlecode.cqengine.query.option.QueryOptions;
 import com.googlecode.cqengine.resultset.ResultSet;
 import javafx.util.converter.BigDecimalStringConverter;
@@ -14,9 +15,7 @@
 import java.util.function.Function;
 import java.util.function.Supplier;

-import static com.googlecode.cqengine.query.QueryFactory.endsWith;
-import static com.googlecode.cqengine.query.QueryFactory.equal;
-import static com.googlecode.cqengine.query.QueryFactory.or;
+import static com.googlecode.cqengine.query.QueryFactory.*;

 /**
  * Created by kimptoc on 20/05/2016.
@@ -28,6 +27,7 @@
     static String[] MODELS = {"1", "2","3","4"};

     public static void main(String[] args) {
+    for (int i = 0; i < 2; i++) {
         System.out.println("CQEngine Test - starting");
         int secondsForCreation = 1;

@@ -55,6 +55,7 @@
         // test 3 - use MapEntity

         System.out.println("CQEngine Test - done");
+    }

     }

@@ -75,10 +76,10 @@

         long start = System.nanoTime();
         int numFound = -1;
-        int numQueries = cars.size() / 10;
+        int numQueries = 100;
         for (int i = 0; i< numQueries; i++)
         {
-            ResultSet<T> resultSet = cars.retrieve(query1);
+            ResultSet<T> resultSet = cars.retrieve(query1, queryOptions(deduplicate(DeduplicationStrategy.MATERIALIZE)));
             if (numFound == -1)
             {
                 numFound = resultSet.size();
kimptoc commented 8 years ago

Ok - seeing something similar now (updated the test code).

When I get a chance, will do/try the MapEntity next

kimptoc commented 8 years ago

I have added MapEntity and a KeyedMapEntity ( https://github.com/kimptoc/cqengine/tree/issue-62-map-attributes ).

MapEntity just wraps a map, no key based tuning. KeyedMapEntity extends that with using a supplied key to optimise equals.

However, the performance seems to be comparable ( https://github.com/kimptoc/cqengine-test ).

MapEntity seems to have performance only slightly worse than using real objects.

Note - my tests are using 1.8.

Could you have a look at the changes I have made - is that the kind of thing you intended?

Cheers.

npgall commented 8 years ago

Hi Chris,

Yes that's exactly what I had in mind, nice work!

In MapEntity, you could optimize equals() by comparing the cached hash codes before delegating to the equals() methods. Two objects with different hash codes are not allowed to be equal, so if the cached hash codes are different you can just return false. And then you only need to call map.equals() when the hashcodes are the same.

I think we will get 95% of the speed boost just by caching the hashcode, and using it to optimize the equals() method.

Ordinarily with Java collections (including CQEngine), it is not safe to modify objects which have already been added to the collection, because doing so can change their hash codes (which corrupts HashMaps and HashSets etc.). This will definitely apply to MapEntity, because changing any field should change the hash code.

However the KeyedMapEntity will be a useful addition here, because we can say that it will be safe to modify fields as long as the field is not the primary key, and as long as there is no CQEngine index on the field being modified. So can you implement KeyedMapEntity.hashCode() so that its cached hash code is just the hash code of the primary key field? That way, modifying other fields will not change its hash code.

These will be great additions to the library, many thanks! Niall

kimptoc commented 8 years ago

Ok, sounds good - updated my fork and the test repo, a small gain from using the key hashcode & testing hashcodes.

Also added a getMap method to MapEntity to get the original Map - seemed like the simplest option. Works fine for my use case - generally only want one or a few items, and when I need more, am already iterating over the collection - so can be combined with that.

I was thinking to add tests that run something similar to the performance tests I have done, get a timing for pure objects vs pure Maps vs MapEntity and then asserts that the relation between the timings is no worse than some factor, e.g. Map is no worse than 2.5* timings for Car direct and MapEntity is no worse than 1.1*timings for Car direct. Hopefully that should generally pass, respective of the hardware it runs on and helps to ensure we don't lose the performance. What do you think?

npgall commented 8 years ago

That sounds great! Do you want to create a pull request then and we can merge it into CQEngine?

kimptoc commented 8 years ago

I have added a few perf tests to my branch - https://github.com/kimptoc/cqengine/blob/issue-62-map-attributes/code/src/test/java/com/googlecode/cqengine/performance/PerformanceTest.java

Hopefully will flesh out the rest tomorrow and then send you a pull request.

If you get a chance would be good to know if the tests pass on your kit :)

Will try on my laptop later, if I get a chance to see if that passes.

npgall commented 8 years ago

Hi Chris,

I ran those tests, and they all failed on my machine. However I'm not sure what these revamped tests are trying to assert. Previously we already did verify the performance improvement.

So for the pull request, could you strip the number of classes back to a minimum? I don't think we need to incorporate the performance test per-se, because I know that caching the hash code in MapEntity will improve performance. So, we just need the MapEntity and KeyedMapEntity classes, plus regular unit tests for these classes if that's possible.

Thanks! Naill

kimptoc commented 8 years ago

Ok, how does that PR look?

Cheers.

npgall commented 8 years ago

It looks good, thanks! I have merged it. I will look at documentation next, and then merge from this branch into master. I will keep this issue open until then in case we need to discuss further. Thanks for the contribution Chris!

kimptoc commented 8 years ago

Great - thanks Niall. We are using v2.6.0 in prod now with plain Maps (with a 10x gain over our previous solution), looking forward to improving it further with this change.

npgall commented 8 years ago

This is now merged into master. I updated MapEntity to implement the Map interface, so it can be used interchangeably with maps and we can remove some code, but apart from that it works as before. Thanks again!

asuri3 commented 4 years ago

@npgall Do you have any reference to an example where map attribute is present and queried in the class?