miku / esbulk

Bulk indexing command line tool for elasticsearch.
GNU General Public License v3.0
278 stars 41 forks source link

Index failing with `connection reset by peer` #30

Open bnewbold opened 4 years ago

bnewbold commented 4 years ago

I twice attempted to import over 140 million documents into a local, single-node ES 6.8 cluster using a command like the following:

zcat /srv/fatcat/snapshots/release_export_expanded.json.gz |  pv -l | parallel -j20 --linebuffer --round-robin --pipe ./fatcat_transform.py elasticsearch-releases - - | esbulk -verbose -size 10000 -id ident -w 6 -index qa_release_v03b -type release

This is with esbulk 0.5.1. I will retry with the latest 0.6.0.

The index almost completed, but after more than 100m documents, failed with an error like:

2020/01/31 11:49:40 Post http://localhost:9200/_bulk: net/http: HTTP/1.x transport connection broken: write tcp [::1]:56970->[::1]:9200: write: connection reset by peer                                                                      
Warning: unable to close filehandle properly: Broken pipe during global destruction

(the "Warning" part might be one of the other pipeline commands)

I suspect this is actually a problem on the Elasticsearch side... maybe something like a GC pause? I looked in ES logs and see that there were garbage collects up until the time of failure, and none after, but no particularly large or noticeable GC right around the failure.

I would expect the esbulk HTTP retries to resolve any such issues; I assume in this case all the retries failed. Perhaps longer, more, or exponential back-offs would help. Unfortunately, I suspect that this failure may be difficult to reproduce reliably, as it has only occurred with these very large imports.

esbulk has been really useful, thank you for making it available and any maintenance time you can spare!

bnewbold commented 4 years ago

As a follow-up on this issue, if I recall correctly the root issue was having individual batches that were too large (in bytes, not number of documents) and ES would refuse them. Worked around this by decreasing batch size.