miku / esbulk

Bulk indexing command line tool for elasticsearch.
GNU General Public License v3.0
278 stars 41 forks source link

Losing documents #5

Closed albertpastrana closed 7 years ago

albertpastrana commented 7 years ago

Hi again,

We've been doing some tests using esbulk and it works really well in most cases, but we've found some issues when loading millions of documents.

The scenario is the following, we have a line delimited json with 50 million documents (we've tested with 20M too and we get the same results), each of the document is something like this:

For the purposes of this test, we have a cluster with 8 machines, each of them with SSD disks and 16 cores.

From one of the nodes of the cluster we run the following:

  1. Create the index
$ curl -XPUT "http://localhost:9200/index_name" -d'
{
  "settings": {
    "number_of_shards": 256,
    "number_of_replicas": 0
  },
  "mappings": {
     "default": {
        "properties": {
           "p1": {
              "type": "short"
           },
           "p2": {
              "type": "short"
           },
...
}
  1. Load the data
$ esbulk -server http://localhost:9200 -size 5000 -w 16 -index index_name -verbose /data-file

We've done the tests using different configurations (size of 1000, 5000, 10000, 20000 and number of workers 1, 4, 8, 16) and we always get similar results.

The thing is that we get a message like

50000000 docs in Xs at Y docs/s with Z workers

But when we check the size of the index (either by querying GET /index_name/_search or GET /_cat/indices?v) we get a smaller number of documents in it. Sometimes it's only a few hundreds and sometimes some thousands, it looks like the higher the number of workers, the more documents are lost.

We've also checked the logs of all the nodes in the cluster and we haven't seen anything unusual, actually, the only messages are the following:

updating refresh_interval from [1s] to [-1]
updating refresh_interval from [-1] to [1s]

Which would indicate a successful load (if we, for example, load an incorrect document, we can see the error log).

It's a very strange behavior and we wonder why we don't see any log either in ES or in esbulk indicating the problem with some documents.

Any help will be much appreciated, thanks in advance!

ps1: we've tested with smaller batch sizes (1M docs) and it works well, it looks like it only happens with big loads. ps2: if you need a sample file we can provide it although it's pretty big (>10G compressed) or we could provide the code that we've used to generate the data

miku commented 7 years ago

Thanks for the detailed report. This is very unfortunate and I'd like to fix this quickly. It worries me, that there is no hint in the elasticsearch logs. We had one similar case, but it turned out that some documents contained duplicate IDs. But as these are automatically generated in your example, we can eliminate this possibility (id generation happens with uuids in es).

ps2: if you need a sample file we can provide it although it's pretty big (>10G compressed) or we could provide the code that we've used to generate the data

Either way would be great, if it's not a hassle.

albertpastrana commented 7 years ago

We had an id and we tried to use the -id option as well, but it didn't work either.

To generate the data you could use dummy-json and the following template:

{"pid":"{{guid}}","arr1":[{{int 1 30000}}{{#repeat 49 comma=false}},{{int 1 30000}}{{/repeat}}],"arr2":[{{int 1 30000}}{{#repeat 49 comma=false}},{{int 1 30000}}{{/repeat}}],"arr3":[{{int 1 250}}{{#repeat 14 comma=false}},{{int 1 250}}{{/repeat}}],"arr4":[{{int 1 400}}{{#repeat 4 comma=false}},{{int 1 400}}{{/repeat}}],"arr5":[{{int 1 400}}{{#repeat 4 comma=false}},{{int 1 400}}{{/repeat}}],"geo":"{{lat}},{{long}}","attr1":{{int 1 50}},"attr2":{{int 1 50}},"attr3":{{int 1 50}},"attr4":{{int 1 50}},"attr5":{{int 1 50}},"attr6":{{int 1 50}},"attr7":{{int 1 50}},"attr8":{{int 1 50}},"attr9":{{int 1 50}},"attr10":{{int 1 50}},"attr11":{{int 1 50}},"attr12":{{int 1 50}},"attr13":{{int 1 50}},"attr14":{{int 1 50}},"attr15":{{int 1 50}},"attr16":{{int 1 50}},"attr17":{{int 1 50}},"attr18":{{int 1 50}},"attr19":{{int 1 50}},"attr20":{{int 1 50}},"attr21":{{int 1 50}},"attr22":{{int 1 50}},"attr23":{{int 1 50}},"attr24":{{int 1 50}},"attr25":{{int 1 50}},"attr26":{{int 1 50}},"attr27":{{int 1 50}},"attr28":{{int 1 50}},"attr29":{{int 1 50}},"attr30":{{int 1 50}},"attr31":{{int 1 50}},"attr32":{{int 1 50}},"attr33":{{int 1 50}},"attr34":{{int 1 50}},"attr35":{{int 1 50}},"attr36":{{int 1 50}},"attr37":{{int 1 50}},"attr38":{{int 1 50}},"attr39":{{int 1 50}},"attr40":{{int 1 50}} }

Given that dummy-json tries to parse the whole file in memory, you can use the following shell script to generate as many rows as you want:

c=$1
while [ $c -ge 0 ]
do
    dummyjson template.hbs
    (( c-- ))
done

Steps to generate the file.

  1. Store the template in a file template.hbs
  2. Create a script run.sh in the same folder using the contents above
  3. Change permissions: $ chmod u+x run.sh
  4. Run it: $ ./run.sh 1000 | gzip > data.ldj.gz
miku commented 7 years ago

Thanks for the detailed explanation. Will have time to investigate this in depth by the end of the week - sorry, and thanks for your patience.

albertpastrana commented 7 years ago

@miku no problem and thanks for taking the time :)

miku commented 7 years ago

Note to self: A first test with 1M mocked JSON docs on a single ES 2.4.1 node seems fine.

$ time esbulk -verbose -index throwaway -z 1M.ldj.gz
...
2016/10/04 13:30:41 1000000 docs in 38m10.21515082s at 436.640 docs/s with 4 workers
2016/10/04 13:30:41 set index.refresh_interval to 1s: 200 OK
2016/10/04 13:30:56 index flushed: 200 OK
$ curl -s localhost:9200/throwaway/_count | jq .
{
  "count": 1000000,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  }
}
karanjeets commented 7 years ago

@miku Facing the same issue here. Losing 50% of documents

Logs

2016/11/25 19:21:40 [worker-28] @2000
2016/11/25 19:21:40 [worker-28] @2000
2016/11/25 19:21:54 [worker-26] @1806
2016/11/25 19:21:54 83280 docs in 2m34.727564466s at 538.236 docs/s with 48 workers
2016/11/25 19:21:54 set index.refresh_interval to 1s: 200 OK

2016/11/25 19:22:27 index flushed: 200 OK
curl -XGET 'http://localhost:9200/fall-2016/_count?pretty'
{
  "count" : 45868,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

The count goes up to 82,945 when I use 4 workers. There seems to be some problem with parallel indexing. Anything specific I need to configure in Elasticsearch for parallel indexing?

Okay, I investigated further. Out of 83,280 documents, there are 82,945 documents with unique id. Therefore, it seems to work well with 4 workers.

miku commented 7 years ago

@karanjeets, thanks for the bug report. Did you use a cluster (if so, how many nodes) or a single elasticsearch server?

miku commented 7 years ago

@albertpastrana, @karanjeets I believe I found the problem and the fix. May I kindly ask you to upgrade to 0.4.0. If the problem persist, I'd be glad, if you let me know. Thanks a lot.

karanjeets commented 7 years ago

@miku, I used a single elasticsearch server. I will upgrade and give it a try in the next week. Thanks for fixing it.

karanjeets commented 7 years ago

@miku, Okay I tried to index it with 48 workers. It is giving me the same result as before. Can you please help?

miku commented 7 years ago

@karanjeets, thanks for testing. Indeed, I am able to reproduce the issue. Elasticsearch bulk request will return a HTTP 200, even if there were some problems during indexing.

2016/11/28 13:00:04 HTTP 200 on http://localhost:9200/issue5/default/_bulk
2016/11/28 13:00:04  {
    "took":1,
    "errors":true,
    "items":[
        {
            "index": {
                "_index": "issue5",
                "_type": "default",
                "_id":"AViqzmvFcb59SOxC0rQf",
                "status":429,
                "error":{
                    "type":"es_rejected_execution_exception",
                    "reason":"rejected execution of org.elasticsearch. ..."
                    ...
                }
            ...
        }
    ]
}

With lower number of workers, elasticsearch won't exhaust its thread pool, so the problem does not occur.

If is not possible to increase the thread pool size above the # of available processors + 1, but it is possible to increase the thread_pool.bulk.queue_size value.

Examples:

$ wc -l 1M.ldj
1000000

$ esbulk -w 48 -verbose 1M.ldj # fine with thread_pool.bulk.queue_size: 100000
...
1000000 docs in 7m14.426745922s at 2301.884 docs/s with 48 workers

$ esbulk -w 4 -verbose 1M.ldj # elasticsearch defaults
...
1000000 docs in 7m53.492212973s at 2111.967 docs/s with 4 workers

I will fix esbulk to slow down and retry the bulk request, if any error happened. For single server setups, the defaults should be save and increasing the number of workers won't result in faster indexing.

miku commented 7 years ago

@albertpastrana, @karanjeets I released another version, esbulk 0.4.1 which fixes the problem, at least to a certain extent (c.f. release notes and b16722f3). It should now be impossible to silently lose documents, as esbulk will now halt at the first bulk error it encounters (these errors come in a HTTP 200 and therefore went undetected until now).

I have successfully tested this on a local, single node setup. As always, feedback welcome. Thanks!

miku commented 7 years ago

We could think about a transactional bracket around the indexing process, some kind of rollback or a in-flight adjustment of the number of workers. Since esbulk is usable without it, I close this issue for now.

I updated the README with a detailed caveat about this issue.