miku / esbulk

Bulk indexing command line tool for elasticsearch.
GNU General Public License v3.0
278 stars 41 forks source link

indexing failed with 400 Bad Request [resolved, issue with ES cluster] #37

Closed arnaudsj closed 3 years ago

arnaudsj commented 3 years ago

Hi,

I am running into the following error when attempting to index using the following command:

esbulk -index osm_test -server https://vpc-xxx.awszone.es.amazonaws.com -u elastic:password -type _doc -0 -verbose -size 1000 -w 1 -id osm_id osm_data.jsonl
...
2021/02/11 21:52:30 [worker-0] @5227000
2021/02/11 21:52:30 message content-length will be 881947
2021/02/11 21:52:31 [worker-0] @5228000
2021/02/11 21:52:31 message content-length will be 884074
2021/02/11 21:52:31 indexing failed with 400 Bad Request: <html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
</body>
</html>

The data is open street map like data formatted in jsonl, and I run into the error after over 5M documents ingested, and I don't see any particular errors on the server side (OpenDistro 1.12 on AWS)

Any ideas on how to actually see the HTTP request in question that generates this error? I am assuming it is particular document, but no particular line number is provided in the error message.

Thank you in advance!

miku commented 3 years ago

Thanks for the bug report. Spontaneously, I'd say that if some documents could be indexed (which looks like it got up to 5228000 docs), then the problem might be that the particular document differs in structure and may not conform to the mapping elasticsearch expects.

I've seen this with real-world data, even it comes with some "schema", there may be slight problems, e.g. a particular value expressed 99% as int and 1% as str or the like - which will confuse the indexer and ES will return a 400.

I wish, I had already implemented a JSON "schema" inference tool, that would surface these kind of problems - but I haven't (yet).


Updates:

$ go get -u -v github.com/miku/jsoninf/cmd/jsoninf
$ jsoninf < file.jsonlines > /dev/null

This command should report type inconsistencies in the input.

arnaudsj commented 3 years ago

@miku thank you for taking a look and writing quickly jsoninf - I tested it out and confirmed that all my jsonl files were in good shape. It took me a few days to track down the original problem, but I can safely said that the issue is not related to esbulk (at least not directly). It turned out that we upgraded our ES cluster in AWS (Open Distro) to the latest version in our QA environment (ES 7.9) and it appears that there is some instability/bug that has not 100% been resolved yet. When I switched to push to a different ES cluster, I did not run into the issue at all.

One suggestion I would like to make though for esbulk, is to have a mode where the concurrency is slowly raised over time, and adapts to the cluster on the other side, because it is almost too easy to overwhelm a cluster on AWS (specially since you can't dedicate ingest nodes yet). Other than that, the tool works like a charm and fast! (roughly allowing to index roughly about 1M docs/min in my use case).

Thanks again!