nomad-coe / nomad

NOMAD lets you manage and share your materials science data in a way that makes it truly useful to you, your group, and the community.
https://nomad-lab.eu
Apache License 2.0
76 stars 16 forks source link

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

Open ml-evs opened 1 year ago

ml-evs commented 1 year ago

Hi devs, as I've discussed in the past with @markus1978, we'd like to be able to do deep pagination of the NOMAD OPTIMADE API. Currently pagination is limited to the first 10000 results of a query, until it hits an ES limitation. This can be circumvented with some work on the ES backend (https://www.elastic.co/guide/en/elasticsearch//reference/current/paginate-search-results.html#search-after) but I think this might a common problem so instead would like to resolve it in my client code at optimade-python-tools.

The "fix" would be to paginate into a query until failure, then use the last page of results to generate a new query that limits results to the remaining pages. This requires a field to sort over, and ideally I would sort over id/immutable_id or _nmd_upload_time. Unfortunately these sorts are not available in the current NOMAD OPTIMADE API, though I think the default sort is _nmd_upload_time.

Of course, this workaround can lead to inconsistent data between paginated queries, but I think basically every other OPTMADE API is already affected by this...

So, I have a few questions:

1) Would it be possible to enable this kind of sort in the OPTIMADE API? 2) Would you support the addition of a specific error type for this in the specification, so clients can detect it and apply a workaround? 3) Do you want to help try to implement the suggested ES search_after fix within optimade-python-tools?

markus1978 commented 1 year ago

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

markus1978 commented 1 year ago

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

ml-evs commented 1 year ago

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

markus1978 commented 1 year ago

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

Yes, but under the hood it is translated into offset-based pagination in ES (this is the only thing implemented). In this discussion offset and page based are synonyms.

ml-evs commented 1 year ago

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

I'll see what I can do -- I can at least patch in the query params for now and release as you suggest. I never ended up making a PR for disabling validation as I couldn't generate a test case where there was any signficant perf. improvement, but in the case of allowing "slightly" wrong data through then it is useful ;)