Open missinglink opened 2 years ago
Huh, fun. Looks like this actually fails because loading the _id
field from disk is expensive:
[exception] java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [3399601624/3.1gb], which is larger than the limit of [3373164134/3.1gb]]
Wow ok, problem with using _doc
instead is it changes between builds.
I'd be interested to know which query caused that, I assumed sort was after hits are pruned
Or at least the second sort criteria was only a secondary sorting based on the top $size hits of the first criteria
CircuitBreakingException
reference for _id
https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-id-field.html
it mentions this:
In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field into another field that has doc_values enabled.
It seems that since docvalues
is not enabled for _id
that the entire document needs to be loaded from disk to use this one field for sorting, which is clearly prohibitively expensive.
So there's a couple options if we'd like to pursue this:
use [_score, _doc]
, this provides a secondary scoring index based on the document insertion order. This would be deterministic within a single build but not deterministic across multiple builds. My assumption is that this is 'cheap' since this _doc
integer is probably already available within the results and wouldn't need to be fetched, but I don't think it really solves the problem, or at least half-solves a problem we might not have 🤔
use some fields with docvalues
, such as something like [_score, layer, source, source_id]
(the first two being defined in the schema as keyword_with_doc_values
and the latter being keyword
which would need to change). This would achieve the desired result but would potentially cause a bit of disk spin and therefore a perf hit.
[edit] reversing the fields to have the less common terms first would be preferable!
My personal preference would be either 3. abandon this and move on /or 2. use docvalues
fields and hope that the fields are small enough that they are always in RAM and therefore don't suffer a perf hit.
Yeah, using [_score, layer, source, source_id]
makes sense: the source_id
field in our elasticsearch documents has been without a purpose for quite some time, since it's duplicated in the _id
field and easily calculated. But if a separate field from _id
is the only way to use docvalues
, then that's perfect.
Seems we can probably use only source_id
since it's almost unique we won't require any further sorting conditions.
The dilemma here is that using such a field will also use the most RAM 🤔
Maybe we want to try a different approach then? What if we re-sorted results in the API after retrieving them from Elasticsearch? Sort first by score, then by gid
. This will always be fast because it's sorting at most ~80 records at once (with size=40
and a factor for extra records to remove dupes), and won't require any changes to our ES queries or their performance characteristics.
If sorting by source_id
in Elasticsearch turned out to be free, then that would also be fine, but I suspect it's not free.
Yeah I had a similar thought, the concern there is if the top n 'hits' returned from ES to the API were non-deterministic then it wouldn't really solve the core issue
Ahh, good point. It would probably handle most cases but definitely not every case.
So lets try a build with source_id
docvalues enabled, and see how bad it is?
If it's really bad maybe we do it API-side for now and work on introducing more scoring differentiators into the queries in the future?
partner PR for https://github.com/pelias/query/pull/130