olivere / elastic

Deprecated: Use the official Elasticsearch client for Go at https://github.com/elastic/go-elasticsearch
https://olivere.github.io/elastic/
MIT License
7.4k stars 1.15k forks source link

Scroll service does not return all errors #888

Open praseodym opened 6 years ago

praseodym commented 6 years ago

Which version of Elastic are you using?

elastic.v6 (for Elasticsearch 6.x)

Any steps to reproduce the behavior?

While implementing a parallel sliced scroll using the scroll service, I made the mistake of re-using the same scroll service struct across multiple goroutines. This caused my program to re-use a scroll id, leading to a (partial) failure (search_context_missing_exception) response with zero results from Elasticsearch.

Please describe the actual behavior

The error that Elasticsearch returned was swallowed by the ScrollService: instead, io.EOF was returned to indicate that there were no more results. This made the issue very hard to diagnose as I was looking at other factors before suspecting a programming error.

Please describe the expected behavior

The ScrollService.Do function returns an error when a (partial) failure is returned by Elasticsearch.

olivere commented 6 years ago

This line and this line are the ones returning io.EOF. Do you have any reproducible test case where they get there and swallow an error? I don't see it atm. If there were an error from ES, it should be returned from PerformRequest calls, here and here.

praseodym commented 6 years ago

I found that the easiest way to reproduce this was to use my buggy code that re-uses the scroll service across multiple goroutines. I've created a Gist that also includes its output with tracing enabled: https://gist.github.com/praseodym/d3aec96cf01488356af9dfaadd476e9c

Notice that the program completes without error, but saw only 24890 out of the 111394 total documents (this number varies across runs). However, several error responses from Elasticsearch can be found in the trace log, indicating that there is actually a bug in our program:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

{"_scroll_id":"--snip--","took":1,"timed_out":false,"terminated_early":true,"_shards":{"total":5,"successful":4,"skipped":0,"failed":1,"failures":[{"shard":4,"index":"shakespeare","node":"--nip--","reason":{"type":"null_pointer_exception","reason":null}}]},"hits":{"total":22430,"max_score":null,"hits":[]}}

It looks like this error is swallowed because checkResponse doesn't consider a 200 OK status code to ever contain an error: https://github.com/olivere/elastic/blob/e6cae211ee802eab70248a68fe2cb03b616744c9/errors.go#L28-L32 I'm not sure if it is expected behaviour for Elasticsearch to return errors with a 200 OK status code; that could be a bug.

On a side note, the RW mutex mu in scroll.go does not seem to be very useful in its current form. While it does prevent the scroll id from being set in a non-atomic manner, when doing parallel invocations of .next the service scroll id is still clobbered by requests initiated from other goroutines. Synchronising across the entire request-response cycle would solve the problem, but such a race should never occur in a well-written program anyway. https://github.com/olivere/elastic/blob/e6cae211ee802eab70248a68fe2cb03b616744c9/scroll.go#L431-L451

olivere commented 6 years ago

I agree that we should handle this with more care. I think the slicing feature added a side-effect that I wasn't aware of. Before that, I think, the ScrollService should've worked fine and the scrollId was guarded properly. I'll have to look at this when not in a hurry, so give me a few more days.

That being said: The only "thing" that should be tread-safe by default is the *Client. I'm not saying this as an excuse, but I think it's a good rule of thumb. The Scroll API should be better protected against misuse nevertheless.

praseodym commented 6 years ago

No rush! I think adding error handling should be enough to protect against misuse of the Scroll API. Some notion of thread-safety in the documentation could also be useful. I’m not keen on sprinkling mutexes around the codebase, so if they’re not strictly needed during normal operation I would propose to remove them.