miku / esbulk

Bulk indexing command line tool for elasticsearch.
GNU General Public License v3.0
278 stars 41 forks source link

feature request: routing by _id #34

Open bnewbold opened 4 years ago

bnewbold commented 4 years ago

This elasticsearch blog post implies that doing batch indexing of documents all going to the same shard at a time improves performance: https://www.elastic.co/blog/how-kenna-security-speeds-up-elasticsearch-indexing-at-scale-part-1

The feature request for esbulk would be to somehow automate this speed-up, without users needing to re-sort or partition documents themselves. Some unstructured thoughts about this:

miku commented 3 years ago

Great point.

While reading https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html#_making_a_routing_value_required I think a custom routing value would not simplify things - they should be used at index and query time, etc.

The way I see how this could be done, would be a per-shard cache (option 4), in memory (or even temp files, if there are many shards).