Closed erickmelo closed 9 years ago
I performed a search on ElasticSearch for each object of my domain class to check if the object was indexed:
def failures=[]
def allObjects = classe.list()
allObjects.each{
elasticSearchHelper.withElasticSearch { client ->
GetRequest getRequest = new GetRequest(getIndexName(domainClass),getTypeName(domainClass), it.id.toString());
def result = client.get(getRequest).actionGet()
if (!result.isExists()){
failures << it
}
}
The result (60,000 objects) was about 9,000 failures. After successive indexing the amount was falling. However if I explicitly index objects that failed in the batch, the result is 0 fault (ie, all the objects were indexed correctly). Apparently the bug is in batch operations.
failures.each{
it.index()
}
There is a subtle bug in the doBulkRequest(Map options, int operationType) function of ElasticsearchService, at least in the version we forked. The following line does not work as expected, records on the chunk boundary are skipped:
int nbRun = Math.ceil(count / maxRes)
The following fixes the issue, although I'm pretty sure there is a much more efficient mathematical solution:
int nbRun = Math.ceil(count * 1.5 / maxRes)
We don't have that many records, so indexing some records more than once is not an issue, and we were very pressed for time at that stage.
@dawiemalan Thanks for clarifying this point.
Is there any perspective of incorporating this fix in the release version of the plugin, even if it is not the best solution? If we can avoid a fork of the project I believe would be better.
What this line does is calculate the number of operations that will be indexed during a run. To avoid exceding the maximum number of requests in a bulk operation, the changes to be reflected on ES are split in n
, where n
is the first integer greater than the result of dividing the number of operations by the maxBulkRequest
of ES.
Anyways, incorporating this fix seems a bit random and rash: it will make the plugin do - somewhere around - 1.5 times more operations than needed to index / unindex the changes, which may in time cause performance issues.
My point is: I'd use it if I really needed to, but I wouldn't incorporate it in the actual plugin.
I agree, someone needs to look at this in more detail to provide a better fix. This was just a bandaid to get it working. I only shared it to save some time digging in the code to find the issue. I'll see if the guy who did this can spare some time to help.
I've been aware for a while of this bug in the bulkOperations but wasn't able to consistently reproduce it; I can also look into it if anyone can contribute an example app that consistently reproduces it
@noamt I can contribute with an example app that reproduces this bug. I will isolate part of our app and share with you.
@noamt I shared with you an application that reproduces the problem. In all the times I ran the indexing (see integration test) the problem occurred (about 30% of the collection is not being indexed).
I have included in the project the dataset used for the tests. The collection has 188580 objects.
I ran the test and I've managed to reproduce this.
I've added a few more debug log messages to the index and unindex processes; the logs state the number of results found, number of iterations required and an entry for each document fetched from the query result set.
According to the log, all documents that are eventually missing were removed from the index but not fetched again; this leads me to suspect that it's actually an issue with the DB query and fetch rather than the number of iterations done.
I've deployed a new snapshot version (0.0.3.x
) of the plugin to:
http://noams.artifactoryonline.com/noams/grails-elasticsearch-plugin-snapshots
Please try to run the test using the snapshot version and set org.grails.plugins.elasticsearch.ElasticSearchService
to log on debug level and see if you get the same output.
I've tested and got the same output:
2014-10-27 16:08:59,448 [main] DEBUG elasticsearch.ElasticSearchService - Indexing all instances of Artefact > Catmat
2014-10-27 16:08:59,500 [main] DEBUG elasticsearch.ElasticSearchService - Found 188580 instances of Artefact > Catmat
2014-10-27 16:08:59,502 [main] DEBUG elasticsearch.ElasticSearchService - Maximum entries allowed in each bulk reques
t is 500, so indexing is split to 378 iterations
I've looked the ElasticSearch Stats and I got the following results:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
"count" : 117606,
"deleted" : 7305
},
"store" : {
"size_in_bytes" : 67740299,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 188580,
"index_time_in_millis" : 40291,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 92437,
"time_in_millis" : 3027,
"exists_total" : 58443,
"exists_time_in_millis" : 2872,
"missing_total" : 33994,
"missing_time_in_millis" : 155,
"current" : 0
},
"search" : {
"open_contexts" : 0,
"query_total" : 0,
"query_time_in_millis" : 0,
"query_current" : 0,
"fetch_total" : 0,
"fetch_time_in_millis" : 0,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size_in_bytes" : 0,
"total" : 65,
"total_time_in_millis" : 100328,
"total_docs" : 986882,
"total_size_in_bytes" : 555798996
},
"refresh" : {
"total" : 630,
"total_time_in_millis" : 11383
},
"flush" : {
"total" : 0,
"total_time_in_millis" : 0
},
"warmer" : {
"current" : 0,
"total" : 1325,
"total_time_in_millis" : 185
},
"filter_cache" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"id_cache" : {
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"percolate" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 40,
"memory_in_bytes" : 2092264,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0
},
"translog" : {
"operations" : 188580,
"size_in_bytes" : 14270099
},
"suggest" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0
}
},
"total" : {
"docs" : {
"count" : 117606,
"deleted" : 7305
},
"store" : {
"size_in_bytes" : 67740299,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 188580,
"index_time_in_millis" : 40291,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 92437,
"time_in_millis" : 3027,
"exists_total" : 58443,
"exists_time_in_millis" : 2872,
"missing_total" : 33994,
"missing_time_in_millis" : 155,
"current" : 0
},
"search" : {
"open_contexts" : 0,
"query_total" : 0,
"query_time_in_millis" : 0,
"query_current" : 0,
"fetch_total" : 0,
"fetch_time_in_millis" : 0,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size_in_bytes" : 0,
"total" : 65,
"total_time_in_millis" : 100328,
"total_docs" : 986882,
"total_size_in_bytes" : 555798996
},
"refresh" : {
"total" : 630,
"total_time_in_millis" : 11383
},
"flush" : {
"total" : 0,
"total_time_in_millis" : 0
},
"warmer" : {
"current" : 0,
"total" : 1325,
"total_time_in_millis" : 185
},
"filter_cache" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"id_cache" : {
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"percolate" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 40,
"memory_in_bytes" : 2092264,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0
},
"translog" : {
"operations" : 188580,
"size_in_bytes" : 14270099
},
"suggest" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0
}
}
},
"indices" : {
"br.ufscar.es.test" : {
"primaries" : {
"docs" : {
"count" : 117606,
"deleted" : 7305
},
"store" : {
"size_in_bytes" : 67740299,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 188580,
"index_time_in_millis" : 40291,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 92437,
"time_in_millis" : 3027,
"exists_total" : 58443,
"exists_time_in_millis" : 2872,
"missing_total" : 33994,
"missing_time_in_millis" : 155,
"current" : 0
},
"search" : {
"open_contexts" : 0,
"query_total" : 0,
"query_time_in_millis" : 0,
"query_current" : 0,
"fetch_total" : 0,
"fetch_time_in_millis" : 0,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size_in_bytes" : 0,
"total" : 65,
"total_time_in_millis" : 100328,
"total_docs" : 986882,
"total_size_in_bytes" : 555798996
},
"refresh" : {
"total" : 630,
"total_time_in_millis" : 11383
},
"flush" : {
"total" : 0,
"total_time_in_millis" : 0
},
"warmer" : {
"current" : 0,
"total" : 1325,
"total_time_in_millis" : 185
},
"filter_cache" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"id_cache" : {
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"percolate" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 40,
"memory_in_bytes" : 2092264,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0
},
"translog" : {
"operations" : 188580,
"size_in_bytes" : 14270099
},
"suggest" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0
}
},
"total" : {
"docs" : {
"count" : 117606,
"deleted" : 7305
},
"store" : {
"size_in_bytes" : 67740299,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 188580,
"index_time_in_millis" : 40291,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0
},
"get" : {
"total" : 92437,
"time_in_millis" : 3027,
"exists_total" : 58443,
"exists_time_in_millis" : 2872,
"missing_total" : 33994,
"missing_time_in_millis" : 155,
"current" : 0
},
"search" : {
"open_contexts" : 0,
"query_total" : 0,
"query_time_in_millis" : 0,
"query_current" : 0,
"fetch_total" : 0,
"fetch_time_in_millis" : 0,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size_in_bytes" : 0,
"total" : 65,
"total_time_in_millis" : 100328,
"total_docs" : 986882,
"total_size_in_bytes" : 555798996
},
"refresh" : {
"total" : 630,
"total_time_in_millis" : 11383
},
"flush" : {
"total" : 0,
"total_time_in_millis" : 0
},
"warmer" : {
"current" : 0,
"total" : 1325,
"total_time_in_millis" : 185
},
"filter_cache" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"id_cache" : {
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"percolate" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 40,
"memory_in_bytes" : 2092264,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0
},
"translog" : {
"operations" : 188580,
"size_in_bytes" : 14270099
},
"suggest" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0
}
}
}
}
}
This section of JSON caught my attention:
"docs" : {
"count" : 117606,
"deleted" : 7305
},
"index_total" : 188580,
"operations" : 188580,
The total number of operations is correct, but it seems that some indexed objects were deleted. In the log file no records of removal operations documents index were found.
I ran these tests from a clean base, without calling unindex() operation. Theoretically any object should have been removed from the index.
The fact that ElasticSearch (server) log "delete" operations can not mean that objects are being indexed in duplicate? That is, the number of operations is correct, but maybe each object sent to the index are not different from others (maybe when paging the same object are being returned in more than one page)?
My hypothesis is confirmed. I added a clause ordering by id and apparently the problem was solved.
def results = scm.domainClass.clazz.withCriteria {
firstResult(resultToStartFrom)
maxResults(maxRes)
order("id", "asc")
}
"docs" : {
"count" : 188580,
"deleted" : 0
},
That's excellent news, Erick! Would you like to open a pull request? I can also just push the fix myself, if you don't mind.
@noamt I don't mind.. You can fix it... Thanks for helping to solve this problem! Congratulations for the excellent work you have done on this plugin.
Cheers! I released a new version with the fix - 0.0.3.5. Thanks for the participation and contribution!
Good job solving this, I already feel safer using the plugin!
Hi,
I tried the fix ordering by id and the following result: docs: { num_docs: 4064 max_doc: 4064 deleted_docs: 0 }
thus not getting the deleted docs issue as previously but this is still not correct as a query to the db shows there should be 4678 docs.
Regards Rory
I have some large collections of objects (more than 50 000 in each collection) indexed by Elastichsearch (via Grails Plugin). The problem is that some of these objects are not being indexed properly by the plugin (Domainclass.index () or bulkIndexOnStartup). After successive runs of the indexing process these objects are being indexed (or after performing an explicit indexing on the object - object.index ()).
No exception is reported during the process. I performed some tests directly on the Elastichsearch Client, searching objects by ID and in fact they are not indexed.
Someone has reported this type of behavior and how to handle it?
I believe it would be useful to have some plugin mechanism for consistency checking to detect possible failures, even using heavy operations, such as checking for each object of a given domain class if it is in ElasticSearch index.