sirensolutions / siren-join

[This is the old, single node version for Elasticsearch 2.x, see the latest "Siren Federate" plugin for distributed Elasticsearch 5.x and 6.x capabilities]
http://siren.io
GNU Affero General Public License v3.0
183 stars 60 forks source link

Cache hit/miss based on `range` filter in query is hitting same cache #112

Closed teebu closed 7 years ago

teebu commented 7 years ago

After performing a query, the first results works indicated by a "cache_hit": false, changing the range filter numbers, the cache is hit from the previous query, indicated by "cache_hit": true, and returning the same results as previous query.

I added a cache busting filter in the must_not and every time I change it, it returns proper results.

changing: { "range": { "meta.relevance": {"gte":0, "lte":0} } } to { "range": { "meta.relevance": {"gte":1, "lte":1} } } gives same cached result.

Sometimes it takes a few cache-busting (changing the must_not) queries to get it to uncache.

"filterjoin": { // filterjoin can only be used in _coordinate_search endpoint
                  "artistId": { // "foreign key" field in the main doc
                     "indices": [ "developers" ],
                     "types": [ "developerDoc" ],
                     "path": "artistId", // field inside the doc
                     "termsEncoding":"integer",
                     "query": { // query to filter the results down
                        "bool": {
                           "must": [
                                //{ "term": { "artistId": 570060151 } } // by artist id
                                //{ "term": { "meta.relevance": 2 } } // by relevance (cache hit/miss works)
                                { "range": { "meta.relevance": {"gte":1, "lte":1} } } // (cache hit/miss doesnt work - changes here still hits cache)
                           ],
                           "must_not": [
                              { "term": { "meta.relevance": 11 } } // (test by changing this number to reset cache)
                           ]
                        }
                     }
                  }
               }

One thing that is a semi-workaround for my problem that is working, replace range with terms:

 { "terms": { "meta.relevance": [0,1] } }
rendel commented 7 years ago

Hi @teebu Thanks for reporting this issue, I'll try to reproduce it on my side in the next coming days. I'll keep you informed.

eeeebbbbrrrr commented 7 years ago

Hi! My project (@zombodb) supports siren and I'm seeing similar problems.

I've only spent about 8 minutes on this so far, but it looks to me like FilterJoinNode.cacheId is being calculated incorrectly. If I turn the cacheId into a random number, the problems go away (I assume this is effectively defeating the cache).

I'm going to try and debug this further, but wanted to ask if y'all would accept a pull request against the v1.0 plugin -- right now @zombodb only supports ES 1.7, which means it's stuck using the v1.0 plugin.

eeeebbbbrrrr commented 7 years ago

After looking at this a bit more, I think FilterJoinNode.cacheId should be a String instead of an Integer and it should be the .toString() value of the self argument to the FilterJoinNode constructor.

It doesn't seem hard to formulate different queries that hash to the same hashCode().

Making this change seems to fix the problems I've been seeing.

@rendel, what are your thoughts?

rendel commented 7 years ago

@eeeebbbbrrrr could you give more information on how you hit this issue. In siren-join v1.0, the cache was working differently. It does not have a cache id based on the index version. Therefore, if the index changes between two requests in a short time (inferior to the time delay to discard a cache entry), you'll hit an issue.

rendel commented 7 years ago

@teebu which version of siren were you using ?

teebu commented 7 years ago

im using ES 2, with 2.4.1 siren

eeeebbbbrrrr commented 7 years ago

Hi @rendel. Basically the situation is that two different queries end up having the same hashcode, which then causes a cache hit for the first query, rather than a cache miss.

rendel commented 7 years ago

Hi, I have tried to reproduce the issue in a unit test (see branch issue-112 [1]) but without success. Changing the parameters of the range queries is properly taking into account by the cache (at least in my test). If you could try to reproduce the issue in a concise unit test like I did, it would help (just make a patch based on branch issue-112). Thanks.

[1] https://github.com/sirensolutions/siren-join/compare/issue-112?expand=1

teebu commented 7 years ago

You should be looking at the results instead of if its a cache hit or miss.

eeeebbbbrrrr commented 7 years ago

Change the testRangeQueries() function to this:

  @Test
  public void testRangeQueries() throws Exception {
    this.loadData();

    // Joining index1.foreign_key with index2.id
    SearchResponse searchResponse = new CoordinateSearchRequestBuilder(client()).setIndices("index1").setQuery(
        boolQuery().filter(
            QueryBuilders.filterJoin("foreign_key").indices("index2").types("type").path("id").query(
                boolQuery().filter(rangeQuery("id").gte(0).lte(0))
            ))
    ).get();
    assertHitCount(searchResponse, 0L);
    assertThat(((CoordinateSearchResponse) searchResponse).getCoordinateSearchMetadata().getActions().get(0).cacheHit(), is(equalTo(false)));

    // Joining index1.foreign_key with index2.id
    searchResponse = new CoordinateSearchRequestBuilder(client()).setIndices("index1").setQuery(
        boolQuery().filter(
            QueryBuilders.filterJoin("foreign_key").indices("index2").types("type").path("id").query(
                boolQuery().filter(rangeQuery("id").gte(1).lte(1))
            ))
    ).get();
    assertHitCount(searchResponse, 0L);
    assertThat(((CoordinateSearchResponse) searchResponse).getCoordinateSearchMetadata().getActions().get(0).cacheHit(), is(equalTo(false)));
  }

It'll fail on the last assertThat() in the method.

Behind the scenes, in FilterJoinNode.java, the queryHash property is calculated to be 1740291740 (on my laptop using Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)) for both queries.

FilterJoinNode.java is essentially trying to use an object's hashcode as an identity value for lookup in the cache, and hasCodes just can't be used for identity purposes. You either need to use something that's more unique (which is why I previously suggested making the cache id a String and using the .toString() of the param map), or you need to be able to call .equals() when looking up in the cache -- which doesn't seem like a thing in this case because all you have is a key/id.

eeeebbbbrrrr commented 7 years ago

If you'd prefer to see the diff:

diff --git a/src/test/java/solutions/siren/join/action/admin/cache/FilterJoinCacheActionTest.java b/src/test/java/solut
index d0833d2..758c4ad 100644
--- a/src/test/java/solutions/siren/join/action/admin/cache/FilterJoinCacheActionTest.java
+++ b/src/test/java/solutions/siren/join/action/admin/cache/FilterJoinCacheActionTest.java
@@ -102,18 +102,17 @@ public class FilterJoinCacheActionTest extends SirenJoinTestCase {
     SearchResponse searchResponse = new CoordinateSearchRequestBuilder(client()).setIndices("index1").setQuery(
         boolQuery().filter(
             QueryBuilders.filterJoin("foreign_key").indices("index2").types("type").path("id").query(
-                boolQuery().filter(rangeQuery("id").gte("0").lte("1"))
+                boolQuery().filter(rangeQuery("id").gte(0).lte(0))
             ))
     ).get();
-    assertHitCount(searchResponse, 2L);
-    assertSearchHits(searchResponse, "1", "4");
+    assertHitCount(searchResponse, 0L);
     assertThat(((CoordinateSearchResponse) searchResponse).getCoordinateSearchMetadata().getActions().get(0).cacheHit(

     // Joining index1.foreign_key with index2.id
     searchResponse = new CoordinateSearchRequestBuilder(client()).setIndices("index1").setQuery(
         boolQuery().filter(
             QueryBuilders.filterJoin("foreign_key").indices("index2").types("type").path("id").query(
-                boolQuery().filter(rangeQuery("id").gte("0").lte("0"))
+                boolQuery().filter(rangeQuery("id").gte(1).lte(1))
             ))
     ).get();
     assertHitCount(searchResponse, 0L);
rendel commented 7 years ago

@eeeebbbbrrrr thanks, I was able to reproduce the issue. The cause is a hash collision due to the way the HashMap.hashCode is computed (by additioning the hash code of each entry, where each entry hash code is computed by an exclusive OR between the hash codes of its key and value).

Objects.hashCode("from") ^ Objects.hashCode(0) = 3151786 ^ 0 = 3151786
Objects.hashCode("to") ^ Objects.hashCode(0) = 3707 ^ 0 = 3707
3151786 + 3707 = 3155493

Objects.hashCode("from") ^ Objects.hashCode(1) = 3151786 ^ 1 = 3151787
Objects.hashCode("to") ^ Objects.hashCode(1) = 3707 ^ 1 = 3706
3151787 + 3706 = 3155493

We should be able to fix the issue by implementing a hash function that is using a bit mixer [1] for improving the distribution of hash value as it is done in the hppc library [2]. I'll work on a patch.

[1] http://zimbry.blogspot.fr/2011/09/better-bit-mixing-improving-on.html [2] https://github.com/carrotsearch/hppc/blob/0.7.1/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeHashMap.java#L615

rendel commented 7 years ago

Patch provided in pull request #117.

teebu commented 7 years ago

Now that the issue is closed, will I be able to just install it like this, or do I have to figure out how to compile for my version?

bin/plugin install solutions.siren/siren-join/2.4.1
eeeebbbbrrrr commented 7 years ago

@rendel, I don't believe the problem is how the hashcode is being generated, the problem is that a hascode is being used for identity/equality, which is, in my opinion, inherently broken.

Changing how the hashcode is calculated only changes the computed values, it doesn't eliminate the possibility that two different queries will hash to the same value.

I really don't think this issue is resolved, and if anything, PR #117 just makes the problem harder to detect.

rendel commented 7 years ago

@eeeebbbbrrrr one of the issue is using a json string serialisation as cache id is that this cache id will be highly sensitive to any changes in the json query: order of attributes, blank characters, etc. To avoid this, the json query is first converted into a map representation, and then a hash of the map is computed. The computation of the hash computation is made so that it is not dependent on the order of the map entries. There is also another historical requirement on why we compute a hash for the query. This is for the lucene query cache. The key of the lucene query cache is based on the Query object. However, in our case, the Query object can contain a very large bytes array (containing the join values), which is not something we want to put as cache key. Therefore, this json query cache id is used instead of the byte array. It is true that whenever we are using a hashing mechanism, we are subject to collisions, which are themselves subject to a probability. If the odds of a hash collision in a particular scenario is very small, then the benefits outweigh the disadvantage. Here the probability was skewed due to the way the hash was computed, which was solved by using a bit mixer to improve the hash distribution. We could further reduce the odds of collision by using a 64 bits hash instead of a 32 bits. If one wants to remove completely the odds of collisions, the proper way would be to use the Map object as cache key (instead of a string), but we would have to refactor different parts of the code. This would be subject to another issue, and not this one. Hope this clarifies the current decision.

rendel commented 7 years ago

@teebu this will be included in the next release 2.4.4, but at the moment if you want to use it you'll have to build the siren-join plugin yourself.