Open ml-evs opened 1 year ago
You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.
The OPTIMADE standard offers page_above
and page_below
to do value-based pagination.
We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.
To implement this clean and easy, the OPT function handle_query_params
should extract the page_above
and page_below
query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.
Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.
You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.
The OPTIMADE standard offers
page_above
andpage_below
to do value-based pagination.We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.
To implement this clean and easy, the OPT function
handle_query_params
should extract thepage_above
andpage_below
query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.
If that would work then lets do it! We only have number-based pagination via page_number
and offset-based via page_offset
at the moment.
Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.
As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after
approach suggested by ES, just without access to the actual indexes?)
If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.
Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below
into handle_query_params
for the next release, I can do the ES part.
As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)
Yes, but under the hood it is translated into offset-based pagination in ES (this is the only thing implemented). In this discussion offset and page based are synonyms.
Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze
page_below
intohandle_query_params
for the next release, I can do the ES part.
I'll see what I can do -- I can at least patch in the query params for now and release as you suggest. I never ended up making a PR for disabling validation as I couldn't generate a test case where there was any signficant perf. improvement, but in the case of allowing "slightly" wrong data through then it is useful ;)
Hi devs, as I've discussed in the past with @markus1978, we'd like to be able to do deep pagination of the NOMAD OPTIMADE API. Currently pagination is limited to the first 10000 results of a query, until it hits an ES limitation. This can be circumvented with some work on the ES backend (https://www.elastic.co/guide/en/elasticsearch//reference/current/paginate-search-results.html#search-after) but I think this might a common problem so instead would like to resolve it in my client code at optimade-python-tools.
The "fix" would be to paginate into a query until failure, then use the last page of results to generate a new query that limits results to the remaining pages. This requires a field to sort over, and ideally I would sort over
id
/immutable_id
or_nmd_upload_time
. Unfortunately these sorts are not available in the current NOMAD OPTIMADE API, though I think the default sort is_nmd_upload_time
.Of course, this workaround can lead to inconsistent data between paginated queries, but I think basically every other OPTMADE API is already affected by this...
So, I have a few questions:
1) Would it be possible to enable this kind of sort in the OPTIMADE API? 2) Would you support the addition of a specific error type for this in the specification, so clients can detect it and apply a workaround? 3) Do you want to help try to implement the suggested ES
search_after
fix within optimade-python-tools?