ruflin / Elastica

Elastica is a PHP client for elasticsearch
http://elastica.io/
MIT License
2.26k stars 734 forks source link

Documents sometimes not correctly updated resulting in data loss. #562

Open tom-pryor opened 10 years ago

tom-pryor commented 10 years ago

During medium volumes of individual documents being indexed (5 or so a second) we're noticing some data loss. A new document is indexed then shortly after (a few seconds, greater than refresh_interval, i.e the document is indexed) we attempt to update a field in the document again using Elastica.

However, a "Undefined index _version" at:

https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Type.php#L248

Occasionally occurs when updating the document. When this occurs the whole document is replaced solely with the updated field(s) and nothing else, causing data loss.

The error logging tool we are using records the context and the value of $result is very strange. It's an array of 4 elements:

Key Value
_shards Array of length 3
hits Array of length 3
timed_out false
took 414

Which seems to indicate a query was performed rather than fetching the document by id.

Running Elasticsearch 1.0.1 and Elastica 1.0.0.

I'll try and see if I can get some more information.

tom-pryor commented 10 years ago

Been playing around and the issue seems related to the persistent curl connection. I've logged the responses and using a persistent curl connection seems to occasionally either return stale (i.e the result of a previous request) or blank responses.

ruflin commented 10 years ago

Did you try to turn off persistent? https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Client.php#L38

It will make it slower, but perhaps it solves the problem.

An other good option is to us Bulk queries if you have a lot of requests.

tom-pryor commented 10 years ago

Yeah, turning off persistent fixes the problem. Although I'm not sure why the issue is occurring with persistent enabled, seems like strange behaviour.

I'd use bulk queries but the problem is it is indexing data received over an API (i.e have no control when data comes in) and it needs to be available to search pretty much instantly.

ruflin commented 10 years ago

What php and curl version do you use?

ruflin commented 10 years ago

@Tomdarkness Can you check if this change resolves your problem? https://github.com/ruflin/Elastica/pull/567/files