pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
218 stars 162 forks source link

deterministic score tiebreaker #1608

Open missinglink opened 2 years ago

missinglink commented 2 years ago

partner PR for https://github.com/pelias/query/pull/130

orangejulius commented 2 years ago

Huh, fun. Looks like this actually fails because loading the _id field from disk is expensive:

[exception] java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [3399601624/3.1gb], which is larger than the limit of [3373164134/3.1gb]]
missinglink commented 2 years ago

Wow ok, problem with using _doc instead is it changes between builds.

missinglink commented 2 years ago

I'd be interested to know which query caused that, I assumed sort was after hits are pruned

missinglink commented 2 years ago

Or at least the second sort criteria was only a secondary sorting based on the top $size hits of the first criteria

missinglink commented 2 years ago

CircuitBreakingException

https://gitlab.com/crossref/event_data_query/-/issues/33

missinglink commented 2 years ago

reference for _id https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-id-field.html it mentions this:

In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field into another field that has doc_values enabled.

It seems that since docvalues is not enabled for _id that the entire document needs to be loaded from disk to use this one field for sorting, which is clearly prohibitively expensive.

So there's a couple options if we'd like to pursue this:

  1. use [_score, _doc], this provides a secondary scoring index based on the document insertion order. This would be deterministic within a single build but not deterministic across multiple builds. My assumption is that this is 'cheap' since this _doc integer is probably already available within the results and wouldn't need to be fetched, but I don't think it really solves the problem, or at least half-solves a problem we might not have 🤔

  2. use some fields with docvalues, such as something like [_score, layer, source, source_id] (the first two being defined in the schema as keyword_with_doc_values and the latter being keyword which would need to change). This would achieve the desired result but would potentially cause a bit of disk spin and therefore a perf hit.

[edit] reversing the fields to have the less common terms first would be preferable!

missinglink commented 2 years ago

My personal preference would be either 3. abandon this and move on /or 2. use docvalues fields and hope that the fields are small enough that they are always in RAM and therefore don't suffer a perf hit.

orangejulius commented 2 years ago

Yeah, using [_score, layer, source, source_id] makes sense: the source_id field in our elasticsearch documents has been without a purpose for quite some time, since it's duplicated in the _id field and easily calculated. But if a separate field from _id is the only way to use docvalues, then that's perfect.

missinglink commented 2 years ago

Seems we can probably use only source_id since it's almost unique we won't require any further sorting conditions. The dilemma here is that using such a field will also use the most RAM 🤔

orangejulius commented 2 years ago

Maybe we want to try a different approach then? What if we re-sorted results in the API after retrieving them from Elasticsearch? Sort first by score, then by gid. This will always be fast because it's sorting at most ~80 records at once (with size=40 and a factor for extra records to remove dupes), and won't require any changes to our ES queries or their performance characteristics.

If sorting by source_id in Elasticsearch turned out to be free, then that would also be fine, but I suspect it's not free.

missinglink commented 2 years ago

Yeah I had a similar thought, the concern there is if the top n 'hits' returned from ES to the API were non-deterministic then it wouldn't really solve the core issue

orangejulius commented 2 years ago

Ahh, good point. It would probably handle most cases but definitely not every case.

So lets try a build with source_id docvalues enabled, and see how bad it is?

If it's really bad maybe we do it API-side for now and work on introducing more scoring differentiators into the queries in the future?