taraslayshchuk / es2csv

Export from an Elasticsearch into a CSV file
Apache License 2.0
510 stars 191 forks source link

expired(multiple reads?) #10

Closed WormsCH closed 7 years ago

WormsCH commented 8 years ago

Hi,

While I'm trying to perform queries over more than 7 days, the script just freeze and stop, I had to kill it. (es2csv was installer trough pip install)

I ten installed es2csv using the sources on Github and now I'm getting this error instead: "expired(multiple reads?)"

Here is the output: "Scroll[c2NhbjswOzE7dG90YWxfaGl0czoxMzg4Nzg7] expired(multiple reads?). Saving loaded data.############################## ] [115189/138878] [ 82%] [0:00:10] [ETA: 0:00:02] [ 11.28 kdocs/s]"

And obviously not all data are in the CSV.

While I'm trying to perform the same query in Kibana, I'm able to retrieve all the data. Where the issue come from ?

Is Elasticsearch timed out during the query ? How can I solve this issue ?

Regards, Cédric

taraslayshchuk commented 7 years ago

Hi,

I have added this check to prevent multiple reads exception and script freeze(#9).

Elasticsearch has 30m timeout per scroll page and 120sec per http request. You should provide more information about your Elasticsearch (version, architecture, index settings, index mapping) and more information about es2csv args. If you are losing some information probably it could be hardware issue. Logs from Elasticsearch during scroll process can dot your i's and cross your t's.

taraslayshchuk commented 7 years ago

@WormsCH, @conradlee is this issue still reproduced for you?

conradlee commented 7 years ago

yes I encountered it again, even with your patch

On Mon, Oct 24, 2016, 10:34 AM Taras Layshchuk notifications@github.com wrote:

@WormsCH https://github.com/WormsCH, @conradlee https://github.com/conradlee is this issue still reproduced for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/taraslayshchuk/es2csv/issues/10#issuecomment-255758067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOGJZKHHp-GiaBOv4zSvl9POVIklMWKks5q3MH3gaJpZM4KGFC7 .

taraslayshchuk commented 7 years ago

@conradlee You should provide more information about your Elasticsearch (version, architecture, index settings, index mapping) and more information about es2csv args, version, python and pip versions, OS version.

conradlee commented 7 years ago

sorry on the road now but I'll try to replicate this problem and document all those important details when I'm done traveling next week

On Thu, Oct 27, 2016, 6:20 PM Taras Layshchuk notifications@github.com wrote:

@conradlee https://github.com/conradlee You should provide more information about your Elasticsearch (version, architecture, index settings, index mapping) and more information about es2csv args, version, python and pip versions, OS version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/taraslayshchuk/es2csv/issues/10#issuecomment-256786556, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOGJSRP7rzhQNP3lFC-5Dy9SIIMPJsYks5q4SPIgaJpZM4KGFC7 .

conradlee commented 7 years ago

Ok, I can provide you with some information:

I have a theory about what's causing the infinite loop. The query I'm running selects all documents with a saved date less than some specified cutoff. It's a big query though, so it takes around 12 hours for es2csv to scroll through all the results and save them. In the meantime, some of the documents in the original result set have been re-saved, removing them from the result set.

Depending on how the scrolling is implemented, this could mean that the final result set is smaller than the original result set, which means that the while loop never exits.

taraslayshchuk commented 7 years ago

The es2csv is using under the hood scroll-api, or rather to be precise elasticsearch-py.scroll-api. I have never test it on editable indexes and can not find any documentation about logic how it works. So my advice is to copy your index (to make it read only) and to query it with your request. Logs from ES could help too.

taraslayshchuk commented 7 years ago

@conradlee Oh, looks like I found out the root cause(source):

For Elasticsearch 2.0 and later, use the major version 2 (2.x.y) of the library.

For Elasticsearch 1.0 and later, use the major version 1 (1.x.y) of the library.

So an issue can be that es2csv is using elasticsearch-py version: 2.4.0 and You have Elasticsearch version: 1.7.