Closed TobiasNx closed 1 year ago
To be discussed @fsteeg and @blackwinter : we could enhance HttpOpener, but I think better would be a module on its own, like:
$urlToRetrieveAllTotalItemsAsJson
| open-http
|decode-json(recordpath="pagination.totalItems")
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;
Identified with @TobiasNx at least 3 different types of pagination:
404
appears(1) and (2) are already demanded by oersi. The httpPager would be capable of doing all 3 kinds of paging types. Using (1) would be independent of the serialisation of the return of the API (json/xml/text). (3) could maybe also be used as a naive OAIPMH harvester.
I don't quite understand how your example(s) would work. I think I'd need a concrete API / URL / input data (for the different kinds of pagination).
Perhaps starting with the OERSI example that I used in my first / WIP approach to pagination support: https://gitlab.com/oersi/oersi-etl/-/commit/60ba8d70b1dd728006cd07485b632ecef961e98d
Did you see the Swissbib approach: https://github.com/linked-swissbib/swissbib-metafacture-commands#open-multi-http
And I had this general thought of using URL globbing for paging somehow: https://everything.curl.dev/cmdline/globbing
re swissbib: is just a very sad way to do it, because you have to know how many data there is to begin with and set these values via parameters. My approach would parse the actual totalItems and pass them to the pager.
Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)
re globbing: this is type (2).
Concrete example for using an API (type (1) ) (enabled by https://github.com/metafacture/metafacture-core/pull/463):
'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-", method="post")
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;
one thing I think is difficult is, that the input within MF for page-http
would vary for every paging type. This would make this modul complex in it self.
I don't think I have much to add at this point. Except maybe that type 3 could also be useful for Elasticsearch pagination (search_after
and/or scroll
).
Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)
You mean the WIP approach I mentioned above? That actually removes using the sitemap to get the list of resources.
I'm not sure if it makes sense to approach this in a generic way. It all depends on what API we actually want to talk to. Maybe we should start with implementing paging for specific use cases instead, and then try to generalize that.
Discussed in our planning meeting: I will try to implement the approach described by @dr0i in https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1237836374 for edusharing APIs (4 workflows) in OERSI (e.g. ZOERR, see https://gitlab.com/oersi/oersi-etl/-/issues/64).
I will try to implement the approach described by @dr0i in https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1237836374 for edusharing APIs (4 workflows) in OERSI
To recall, this was the sketched approach from above:
'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
| print
;
One problem is that conceptually, this part has to be called repeatedly:
'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)
With different skipCount=
values, but always passing the body to POST. So further down, when we want to do something like page-http
, we no longer have the {"criterias": [], "facettes": []}
body to POST. We also need to process the response of open-http
to pass each record on the page, which would work with decode-json
and recordPath
, but then we'd kind of have to go back to call again, with incremented skipCount=
.
To solve this in a generic way as envisioned here, I think we would need some kind of loop construct in the Flux. But that whole setup would become very complex. I think a specific module (or extension of HttpOpener) would be a better way to go. I was going to start with a very specific EdusharingReader, but gladly @TobiasNx and @acka47 stopped me there, asking for something a little more generic, which became a JsonApiReader, and seems to be a good balance of specific and generic, see https://gitlab.com/oersi/oersi-etl/-/merge_requests/227/diffs.
For proceeding here, I suggest we close this issue and keep using that new module for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself.
Just had a short glimpse on this - the idea is to get all items via invoking open-http
and decode-json
(now around 1360) and the do it like swissbib : repeatedly getting the document in page-http(url="$baseUrlWithPaginationParmeter", size="100")
up to this 1360. The skipCount
and the do-loop is done in page-http
and stopped when totalItems
is reached.
Does this not work?
the idea is to get all items via invoking
open-http
anddecode-json
(now around 1360)
I'm not sure I understand. We can't get them all with a single call, that's why we need the paging. (Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)
The
skipCount
and the do-loop is done inpage-http
and stopped whentotalItems
is reached.
If we do the repeated calls in page-http
, we need the JSON body from the first line, since that is part of the API request. (At that position, we're getting the totalItems
number in your example instead.) We could add the request body as an option to page-http
, but that would be redundant and inconsistent (specifying the request body twice, in different ways). I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.
(Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)
Yes, I mean the number. Relying on empty results may be a very good idea - but maybe not, there may be non-empty result containing a valid json saying "nothing here". Idk.
We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways).
You could (re)use a variable: `baseUrlWithPaginationParameter='https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-' -d '{"criterias": [], "facettes": []}''
... | page-http(url="$baseUrlWithPaginationParameter", size="100", setSizeValueForParameter="skipCount")
The page-http
would use the java class corresponding to open-http
(if that's possible), flushing results to downstream modules until skipCount
is greater than the input passed to | page-http
.
I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.
The idea is to be more generic, independent of a JSON-API or XML or whatever (even http-headers, having the proper module). BUT maybe may thinking has a flaw and this generic approach is to complex in itself (commands piping into command piping into ... + using a variable + setting proper parameters) so this is not a viable approach (even if wrappers (new "commands") could be programmed as an abbreviation for the different APIs).
Right, thanks for explaining, now I see how that could be done in a reasonable way. We should keep that in mind for when we have other pagination use cases, this could help avoiding the need for new modules and duplication.
Quoting myself from https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1295129003:
For proceeding here, I suggest we close this issue and keep using that new module (JsonApiReader) for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself. Edit: Or by implementing generic paging as discussed above.
@TobiasNx, since you opened this issue, could you close if you agree?
I am okay with that!
We can only open single URLs but cannot page trough multiple URLs. This is connected to #460 since it is relevant for GET and POST with APIs.