Script hangs while writing it to temporary file

taraslayshchuk / es2csv

Export from an Elasticsearch into a CSV file

Apache License 2.0

511 stars 190 forks source link

Script hangs while writing it to temporary file #9

Closed amitpawar closed 8 years ago

amitpawar commented 8 years ago

Hello

I am using this library for fetching all the records from the ES and it works like a charm. Best thing available for ES to CSV.

Although, sometimes the script hangs during the search_query() phase. It stops writing to the temporary file and will keep on running.

Will this be a script issue or an issue on the ES side.

Any help or pin-pointing to certain direction is appreciated.

Thanks Amit

taraslayshchuk commented 8 years ago

Hello

I can not truely say what really is going with out any information from you. You should provide strack trace from running script.

I can only guess what if script just stop to write to file there is problem with connection. And it should retry after 2 minutes timeout. Did you try to wait?

Maybe you can provide query and arguments which you used for running script, version of script and example of information which you have in your elasticsearch?

amitpawar commented 8 years ago

Hello

Yes, waited for around 1-2 hrs before manually stopping the script. Here is the call: python es2csv.py -u server_address -i foxstream_offers_production_active -f id product_id store_id retailer_id price availability_status product_gtin indexed_at activation_status quantity -q 'retailer_id:50a7fe9fe7ac4a8c8df86fb0189caa66' -o feeds\\offers_test.csv --debug

The information on the ES is about the offers for our website marketplace. The index has around 120 Million records, which we break down into 3 fetches based on retailers and run it as Jenkins scheduled jobs.

The stack trace will just hang at around 95% with no changes in count and ETA for 1-2 hrs. Run query [] [37762125/39583209] [ 95%] [2:45:48] [ETA: 0:07:59] [ 3.80 kdocs/s]

Please let me know your comments.

Thanks Amit

taraslayshchuk commented 8 years ago

Hello, Amit.

I tried to reproduce your issue and found only one version. Did you try to open in browser or by using curl link

http://localhost:9200/_search/scroll?scroll_id=%scroll_id_from_debug_output%

if so - you break page order and elastic search keeps return scroll_id == c2NhbjswOzE7dG90YWxfaGl0czoxOTIwMjQwOw==, this is four times shorter then regular and has zero count elements in hits array. There are not any checks to terminate multiple requests and this runs while process forever with no chance to stop. You could read more about scroll search.

amitpawar commented 8 years ago

Hello

You are right. Added an extra check with while loop: `

       hits_check = res['hits']['total']
       while total_lines != self.num_results and hits_check > 0:

                    res = self.es_conn.scroll(scroll=self.scroll_time, scroll_id=res['_scroll_id'])
                    hits_check = len(res['hits']['hits'])`

The script has been running without any hanging (infinite while loop) for a week now.

Thanks once again.

conradlee commented 8 years ago

@taraslayshchuk Any chance you'll commit this check to master? I'm running into the same issue.

taraslayshchuk commented 8 years ago

This is already done, but not for pip release. In any way, looks like we are still did not found the main cause of the problem (Please look into #10 issue)