miku / esbulk

Bulk indexing command line tool for elasticsearch.
GNU General Public License v3.0
278 stars 41 forks source link

json_parse_exception #10

Closed neha30 closed 6 years ago

neha30 commented 6 years ago

Hi, I am trying to index some data using esbulk. But it is always throwing the error:

My command is esbulk -server http://10.9.9.8:9200 -index epfo_ind -id name,memberOf.occurrence.identifier.transactionNumber -verbose -type epfo_typ -w 4 -size 1000 2017_09_13_10_20_01_epfo.json

indexing failed with 500 Internal Server Error: {"error":{"root_cause":[{"type":"json_parse_exception","reason":"Unrecognized character escape 'L' (code 76)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7240ed3b; line: 1, column: 72]"}],"type":"json_parse_exception","reason":"Unrecognized character escape 'L' (code 76)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@7240ed3b; line: 1, column: 72]"},"status":500}

I thought there might be a problem in JSON file format. So I check by indexing data one by one using curl command. By this, all data got indexed but the same data when I am trying to index through esbulk it is always showing the above error.

I checked with another bulk submission https://github.com/xros/jsonpyes. With this data got indexed.

Why it's giving the error with esbulk. How can I resolve this?

Thanks,

miku commented 6 years ago

Thanks a lot for this bug report. By any chance is there a way you could share an example document that won't get indexed correctly?

miku commented 6 years ago

Just some intermediate debugging results.

I am able to index your example file, if I skip the 4th row:

$ esbulk -server http://localhost:9200 -index epfo_ind -id name \
    -verbose -type epfo_typ -w 1 -size 1 <(sed -e '4d' fixtures/issue-10.ldj)
...
8 docs in 53.729153ms at 148.895 docs/s with 1 workers
...

Indexing only the 4th row fails:

$ esbulk -server http://localhost:9200 -index epfo_ind -id name \
    -verbose -type epfo_typ -w 1 -size 1 <(sed -n 4p fixtures/issue-10.ldj)
...
2017/09/25 23:41:46 indexing failed with 500 Internal Server Error:
{
  "error": {
    "root_cause": [
      {
        "type": "json_e_o_f_exception",
        "reason": "Unexpected end-of-input in VALUE_STRING\n at
  [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@695baf66; line: 1, column: 171]"
      }
    ],
    "type": "json_e_o_f_exception",
    "reason": "Unexpected end-of-input in VALUE_STRING\n at
  [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@695baf66; line: 1, column: 171]"
  },
  "status": 500
}
exit status 1
miku commented 6 years ago

Ok, I think I isolated the bug. The 4th line contains a double back-slash, which is a correctly escaped back-slash. When we read the JSON, we see "name":"R C VENKATESWAR RAO\\" and this escaping is correctly carried over to the content to be indexed. However, in line 169, the ID string is not properly escaped and a single backslash causes the quote to be escaped, although it is the end of string marker.

I created a new release with the fix: 0.4.5. Could you update and check, if your case is solved? Thanks a lot.

neha30 commented 6 years ago

Hi,

I tried indexing the documents with new esbulk version but here also it's giving the same error esbulk_error

Ignoring the document is working fine. esbulk -server http://localhost:9200 -index epfo_ind -id name \ -verbose -type epfo_typ -w 1 -size 1 <(sed -e '4d' fixtures/issue-10.ldj)

Suppose, when indexing the documents if I get error correspond to some document, is there any way by which rather then ignoring the document, I can save that document in some file and continue indexing other documents.

Thanks

neha30 commented 6 years ago

When ignoring the document using below command:

esbulk -server http://localhost:9200 -index epfo_ind -id name -verbose -type epfo_typ -w 1 -size 1 <(sed -e '4d' fixtures/issue-10.ldj)

if I don't know the row number at which I am getting the error then how I can ignore the document.

neha30 commented 6 years ago

Hi Miku,

Thanks for your consideration. The new version resolved my problem. :)

Thanks, Neha

miku commented 6 years ago

The new version resolved my problem.

Glad to hear that.