Unexpected behavior (probably a bug)

erickmelo commented 10 years ago

I have some large collections of objects (more than 50 000 in each collection) indexed by Elastichsearch (via Grails Plugin). The problem is that some of these objects are not being indexed properly by the plugin (Domainclass.index () or bulkIndexOnStartup). After successive runs of the indexing process these objects are being indexed (or after performing an explicit indexing on the object - object.index ()).

No exception is reported during the process. I performed some tests directly on the Elastichsearch Client, searching objects by ID and in fact they are not indexed.

Someone has reported this type of behavior and how to handle it?

I believe it would be useful to have some plugin mechanism for consistency checking to detect possible failures, even using heavy operations, such as checking for each object of a given domain class if it is in ElasticSearch index.

erickmelo commented 10 years ago

I performed a search on ElasticSearch for each object of my domain class to check if the object was indexed:

def failures=[]
def allObjects = classe.list()
allObjects.each{
    elasticSearchHelper.withElasticSearch { client ->
        GetRequest getRequest = new  GetRequest(getIndexName(domainClass),getTypeName(domainClass), it.id.toString());
        def result = client.get(getRequest).actionGet()
        if (!result.isExists()){
            failures << it
        }                       
}

The result (60,000 objects) was about 9,000 failures. After successive indexing the amount was falling. However if I explicitly index objects that failed in the batch, the result is 0 fault (ie, all the objects were indexed correctly). Apparently the bug is in batch operations.

failures.each{
        it.index()
}

dawiemalan commented 10 years ago

There is a subtle bug in the doBulkRequest(Map options, int operationType) function of ElasticsearchService, at least in the version we forked. The following line does not work as expected, records on the chunk boundary are skipped:

int nbRun = Math.ceil(count / maxRes)

The following fixes the issue, although I'm pretty sure there is a much more efficient mathematical solution:

int nbRun = Math.ceil(count * 1.5 / maxRes)

We don't have that many records, so indexing some records more than once is not an issue, and we were very pressed for time at that stage.

erickmelo commented 10 years ago

@dawiemalan Thanks for clarifying this point.

erickmelo commented 10 years ago

Is there any perspective of incorporating this fix in the release version of the plugin, even if it is not the best solution? If we can avoid a fork of the project I believe would be better.

sircelsius commented 10 years ago

What this line does is calculate the number of operations that will be indexed during a run. To avoid exceding the maximum number of requests in a bulk operation, the changes to be reflected on ES are split in n, where n is the first integer greater than the result of dividing the number of operations by the maxBulkRequest of ES.

Anyways, incorporating this fix seems a bit random and rash: it will make the plugin do - somewhere around - 1.5 times more operations than needed to index / unindex the changes, which may in time cause performance issues.

My point is: I'd use it if I really needed to, but I wouldn't incorporate it in the actual plugin.

dawiemalan commented 10 years ago

I agree, someone needs to look at this in more detail to provide a better fix. This was just a bandaid to get it working. I only shared it to save some time digging in the code to find the issue. I'll see if the guy who did this can spare some time to help.

noamt commented 10 years ago

I've been aware for a while of this bug in the bulkOperations but wasn't able to consistently reproduce it; I can also look into it if anyone can contribute an example app that consistently reproduces it

erickmelo commented 10 years ago

@noamt I can contribute with an example app that reproduces this bug. I will isolate part of our app and share with you.

erickmelo commented 10 years ago

@noamt I shared with you an application that reproduces the problem. In all the times I ran the indexing (see integration test) the problem occurred (about 30% of the collection is not being indexed).

I have included in the project the dataset used for the tests. The collection has 188580 objects.

noamt commented 9 years ago

I ran the test and I've managed to reproduce this.

I've added a few more debug log messages to the index and unindex processes; the logs state the number of results found, number of iterations required and an entry for each document fetched from the query result set.

According to the log, all documents that are eventually missing were removed from the index but not fetched again; this leads me to suspect that it's actually an issue with the DB query and fetch rather than the number of iterations done.

I've deployed a new snapshot version (0.0.3.x) of the plugin to: http://noams.artifactoryonline.com/noams/grails-elasticsearch-plugin-snapshots

Please try to run the test using the snapshot version and set org.grails.plugins.elasticsearch.ElasticSearchService to log on debug level and see if you get the same output.

erickmelo commented 9 years ago

I've tested and got the same output:

2014-10-27 16:08:59,448 [main] DEBUG elasticsearch.ElasticSearchService  - Indexing all instances of Artefact > Catmat
2014-10-27 16:08:59,500 [main] DEBUG elasticsearch.ElasticSearchService  - Found 188580 instances of Artefact > Catmat
2014-10-27 16:08:59,502 [main] DEBUG elasticsearch.ElasticSearchService  - Maximum entries allowed in each bulk reques
t is 500, so indexing is split to 378 iterations

I've looked the ElasticSearch Stats and I got the following results:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 117606,
        "deleted" : 7305
      },
      "store" : {
        "size_in_bytes" : 67740299,
        "throttle_time_in_millis" : 0
      },
      "indexing" : {
        "index_total" : 188580,
        "index_time_in_millis" : 40291,
        "index_current" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0
      },
      "get" : {
        "total" : 92437,
        "time_in_millis" : 3027,
        "exists_total" : 58443,
        "exists_time_in_millis" : 2872,
        "missing_total" : 33994,
        "missing_time_in_millis" : 155,
        "current" : 0
      },
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0
      },
      "merges" : {
        "current" : 0,
        "current_docs" : 0,
        "current_size_in_bytes" : 0,
        "total" : 65,
        "total_time_in_millis" : 100328,
        "total_docs" : 986882,
        "total_size_in_bytes" : 555798996
      },
      "refresh" : {
        "total" : 630,
        "total_time_in_millis" : 11383
      },
      "flush" : {
        "total" : 0,
        "total_time_in_millis" : 0
      },
      "warmer" : {
        "current" : 0,
        "total" : 1325,
        "total_time_in_millis" : 185
      },
      "filter_cache" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0
      },
      "id_cache" : {
        "memory_size_in_bytes" : 0
      },
      "fielddata" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0
      },
      "percolate" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0,
        "memory_size_in_bytes" : -1,
        "memory_size" : "-1b",
        "queries" : 0
      },
      "completion" : {
        "size_in_bytes" : 0
      },
      "segments" : {
        "count" : 40,
        "memory_in_bytes" : 2092264,
        "index_writer_memory_in_bytes" : 0,
        "version_map_memory_in_bytes" : 0
      },
      "translog" : {
        "operations" : 188580,
        "size_in_bytes" : 14270099
      },
      "suggest" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0
      }
    },
    "total" : {
      "docs" : {
        "count" : 117606,
        "deleted" : 7305
      },
      "store" : {
        "size_in_bytes" : 67740299,
        "throttle_time_in_millis" : 0
      },
      "indexing" : {
        "index_total" : 188580,
        "index_time_in_millis" : 40291,
        "index_current" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0
      },
      "get" : {
        "total" : 92437,
        "time_in_millis" : 3027,
        "exists_total" : 58443,
        "exists_time_in_millis" : 2872,
        "missing_total" : 33994,
        "missing_time_in_millis" : 155,
        "current" : 0
      },
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0
      },
      "merges" : {
        "current" : 0,
        "current_docs" : 0,
        "current_size_in_bytes" : 0,
        "total" : 65,
        "total_time_in_millis" : 100328,
        "total_docs" : 986882,
        "total_size_in_bytes" : 555798996
      },
      "refresh" : {
        "total" : 630,
        "total_time_in_millis" : 11383
      },
      "flush" : {
        "total" : 0,
        "total_time_in_millis" : 0
      },
      "warmer" : {
        "current" : 0,
        "total" : 1325,
        "total_time_in_millis" : 185
      },
      "filter_cache" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0
      },
      "id_cache" : {
        "memory_size_in_bytes" : 0
      },
      "fielddata" : {
        "memory_size_in_bytes" : 0,
        "evictions" : 0
      },
      "percolate" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0,
        "memory_size_in_bytes" : -1,
        "memory_size" : "-1b",
        "queries" : 0
      },
      "completion" : {
        "size_in_bytes" : 0
      },
      "segments" : {
        "count" : 40,
        "memory_in_bytes" : 2092264,
        "index_writer_memory_in_bytes" : 0,
        "version_map_memory_in_bytes" : 0
      },
      "translog" : {
        "operations" : 188580,
        "size_in_bytes" : 14270099
      },
      "suggest" : {
        "total" : 0,
        "time_in_millis" : 0,
        "current" : 0
      }
    }
  },
  "indices" : {
    "br.ufscar.es.test" : {
      "primaries" : {
        "docs" : {
          "count" : 117606,
          "deleted" : 7305
        },
        "store" : {
          "size_in_bytes" : 67740299,
          "throttle_time_in_millis" : 0
        },
        "indexing" : {
          "index_total" : 188580,
          "index_time_in_millis" : 40291,
          "index_current" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0
        },
        "get" : {
          "total" : 92437,
          "time_in_millis" : 3027,
          "exists_total" : 58443,
          "exists_time_in_millis" : 2872,
          "missing_total" : 33994,
          "missing_time_in_millis" : 155,
          "current" : 0
        },
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0
        },
        "merges" : {
          "current" : 0,
          "current_docs" : 0,
          "current_size_in_bytes" : 0,
          "total" : 65,
          "total_time_in_millis" : 100328,
          "total_docs" : 986882,
          "total_size_in_bytes" : 555798996
        },
        "refresh" : {
          "total" : 630,
          "total_time_in_millis" : 11383
        },
        "flush" : {
          "total" : 0,
          "total_time_in_millis" : 0
        },
        "warmer" : {
          "current" : 0,
          "total" : 1325,
          "total_time_in_millis" : 185
        },
        "filter_cache" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        },
        "id_cache" : {
          "memory_size_in_bytes" : 0
        },
        "fielddata" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        },
        "percolate" : {
          "total" : 0,
          "time_in_millis" : 0,
          "current" : 0,
          "memory_size_in_bytes" : -1,
          "memory_size" : "-1b",
          "queries" : 0
        },
        "completion" : {
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 40,
          "memory_in_bytes" : 2092264,
          "index_writer_memory_in_bytes" : 0,
          "version_map_memory_in_bytes" : 0
        },
        "translog" : {
          "operations" : 188580,
          "size_in_bytes" : 14270099
        },
        "suggest" : {
          "total" : 0,
          "time_in_millis" : 0,
          "current" : 0
        }
      },
      "total" : {
        "docs" : {
          "count" : 117606,
          "deleted" : 7305
        },
        "store" : {
          "size_in_bytes" : 67740299,
          "throttle_time_in_millis" : 0
        },
        "indexing" : {
          "index_total" : 188580,
          "index_time_in_millis" : 40291,
          "index_current" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0
        },
        "get" : {
          "total" : 92437,
          "time_in_millis" : 3027,
          "exists_total" : 58443,
          "exists_time_in_millis" : 2872,
          "missing_total" : 33994,
          "missing_time_in_millis" : 155,
          "current" : 0
        },
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0
        },
        "merges" : {
          "current" : 0,
          "current_docs" : 0,
          "current_size_in_bytes" : 0,
          "total" : 65,
          "total_time_in_millis" : 100328,
          "total_docs" : 986882,
          "total_size_in_bytes" : 555798996
        },
        "refresh" : {
          "total" : 630,
          "total_time_in_millis" : 11383
        },
        "flush" : {
          "total" : 0,
          "total_time_in_millis" : 0
        },
        "warmer" : {
          "current" : 0,
          "total" : 1325,
          "total_time_in_millis" : 185
        },
        "filter_cache" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        },
        "id_cache" : {
          "memory_size_in_bytes" : 0
        },
        "fielddata" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        },
        "percolate" : {
          "total" : 0,
          "time_in_millis" : 0,
          "current" : 0,
          "memory_size_in_bytes" : -1,
          "memory_size" : "-1b",
          "queries" : 0
        },
        "completion" : {
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 40,
          "memory_in_bytes" : 2092264,
          "index_writer_memory_in_bytes" : 0,
          "version_map_memory_in_bytes" : 0
        },
        "translog" : {
          "operations" : 188580,
          "size_in_bytes" : 14270099
        },
        "suggest" : {
          "total" : 0,
          "time_in_millis" : 0,
          "current" : 0
        }
      }
    }
  }
}

This section of JSON caught my attention:

      "docs" : {
        "count" : 117606,
        "deleted" : 7305
      },

"index_total" : 188580,
"operations" : 188580,

The total number of operations is correct, but it seems that some indexed objects were deleted. In the log file no records of removal operations documents index were found.

erickmelo commented 9 years ago

I ran these tests from a clean base, without calling unindex() operation. Theoretically any object should have been removed from the index.

The fact that ElasticSearch (server) log "delete" operations can not mean that objects are being indexed in duplicate? That is, the number of operations is correct, but maybe each object sent to the index are not different from others (maybe when paging the same object are being returned in more than one page)?

erickmelo commented 9 years ago

My hypothesis is confirmed. I added a clause ordering by id and apparently the problem was solved.

https://github.com/noamt/elasticsearch-grails-plugin/blob/master/grails-app/services/org/grails/plugins/elasticsearch/ElasticSearchService.groovy#L260

                        def results = scm.domainClass.clazz.withCriteria {
                            firstResult(resultToStartFrom)
                            maxResults(maxRes)
                            order("id", "asc")
                        }

    "docs" : {
        "count" : 188580,
        "deleted" : 0
      },

noamt commented 9 years ago

That's excellent news, Erick! Would you like to open a pull request? I can also just push the fix myself, if you don't mind.

erickmelo commented 9 years ago

@noamt I don't mind.. You can fix it... Thanks for helping to solve this problem! Congratulations for the excellent work you have done on this plugin.

noamt commented 9 years ago

Cheers! I released a new version with the fix - 0.0.3.5. Thanks for the participation and contribution!

sircelsius commented 9 years ago

Good job solving this, I already feel safer using the plugin!

rorybeling commented 9 years ago

Hi,

I tried the fix ordering by id and the following result: docs: { num_docs: 4064 max_doc: 4064 deleted_docs: 0 }

thus not getting the deleted docs issue as previously but this is still not correct as a query to the db shows there should be 4678 docs.

Regards Rory

noamt / elasticsearch-grails-plugin

Unexpected behavior (probably a bug) #60