Open tomdev opened 8 months ago
Hey @tomdev
Thanks for the PR. I tried to reproduce it by setting low scroll time and adding extra wait time between scroll calls. Whenever there is an mentioned error (search_context_missing_exception
) ES transport gem throws the exception for me:
[404] {"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [24881178]"}},"status":404} (Elasticsearch::Transport::Transport::Errors::NotFound)
Could you please share your chewy version, elasticsearch server version and elasticsearch gem version, that could be helpful?
The issue could be also solved if you could set the scroll pointer expiration (which defaults to 1 minute) for pluck
calls (which is called in chewy:sync
).
Hey Barthez, thanks for reproducing the issue.
Interesting to see the response is actually a 404
response. In our setup we recently migrated from Elasticsearch 7.10 to OpenSearch 1.3; this appears to return a 200
, even though the "search_context_missing_exception"
is being hit.
We're unable to continue using Elasticsearch as we're using the AWS OpenSearch service.
We're running:
chewy (7.3.4)
elasticsearch (7.13.3)
We haven't identified any other issues (thus far) using chewy against OpenSearch.
An attempt to increase the scroll pointer expiration did not succeed; we've set it to 10m
, but due to the search_context_missing_exception
(that in our case was not caused by an expiring scroll window, but seems to be related to an internal OpenSearch error, yielding the same error).
Do you think OpenSearch returning a 200
with failures should be handled in chewy, or would that be in a dependency like ES transport? You are probably more familiar than I am on where this should be handled.
Thanks @tomdev
Chewy does not aim to support OpenSearch. This sounds like an altered behavior of OpenSearch vs Elasticsearch.
I would suggest adding a custom exception, something like MissingHitsInScrollError
, and raising it when we receive no hits when we expect some. I wouldn't parse the response as this is too platform-dependent. Can you do that?
I implemented the MissingHitsInScrollError
when no hits are returned when they are expected. This required me to change the looping behaviour when scrolling; instead of infinitely scrolling we now precalcalculate how often we should perform batched requests.
This is slightly altering the behaviour of how this data is retrieved and even though the test suite succeeds I wanted to double check if anyone knows why the previous approach was chosen (loop until fetched >= total hits). Could I be missing an edge case here that was covered by the previous implementation? The specs don't indicate that.
We've been running this PR in production for ~2 weeks and have seen the failure (where we'd previously got stuck in an infinite loop) now successfully raising the MissingHitsInScrollError
.
Thank you @tomdev. Sorry, I lost track of this PR. Cold you please rebase and fix the conflict? I will try to merge & release it as soon as possible.
We are running into a bug in production when performing
chewy:sync
.Our search backend is intermittently returning a
200
response without hits and containing a backend failure, see example response:scroll_batches currently is not taking these failures into account. Because there are no hits returned, the logic of
fetched >= total
will never be reached, causing the loop to never break.Because of this we've experienced
chewy:sync
running for days instead of an hour. (Yes, we now have proper monitoring in place...)This PR will raise a
Chewy::Error
when the search backend is returning failures.Before submitting the PR make sure the following are checked:
[Fix #issue-number]
(if the related issue exists).master
(if not - rebase it).