metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Add paging support for URL/HTML #464

Closed TobiasNx closed 1 year ago

TobiasNx commented 1 year ago

We can only open single URLs but cannot page trough multiple URLs. This is connected to #460 since it is relevant for GET and POST with APIs.

dr0i commented 1 year ago

To be discussed @fsteeg and @blackwinter : we could enhance HttpOpener, but I think better would be a module on its own, like:

$urlToRetrieveAllTotalItemsAsJson
| open-http
|decode-json(recordpath="pagination.totalItems")
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;

Identified with @TobiasNx at least 3 different types of pagination:

  1. with "total items" and a "size" (see example above)
  2. using $urlWithPageParameter , where the value of "page" is incremented as long as no 404 appears
  3. using a (next/resumption) token

(1) and (2) are already demanded by oersi. The httpPager would be capable of doing all 3 kinds of paging types. Using (1) would be independent of the serialisation of the return of the API (json/xml/text). (3) could maybe also be used as a naive OAIPMH harvester.

fsteeg commented 1 year ago

I don't quite understand how your example(s) would work. I think I'd need a concrete API / URL / input data (for the different kinds of pagination).

Perhaps starting with the OERSI example that I used in my first / WIP approach to pagination support: https://gitlab.com/oersi/oersi-etl/-/commit/60ba8d70b1dd728006cd07485b632ecef961e98d

Did you see the Swissbib approach: https://github.com/linked-swissbib/swissbib-metafacture-commands#open-multi-http

And I had this general thought of using URL globbing for paging somehow: https://everything.curl.dev/cmdline/globbing

dr0i commented 1 year ago

re swissbib: is just a very sad way to do it, because you have to know how many data there is to begin with and set these values via parameters. My approach would parse the actual totalItems and pass them to the pager.

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

re globbing: this is type (2).

Concrete example for using an API (type (1) ) (enabled by https://github.com/metafacture/metafacture-core/pull/463):

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-", method="post")
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
|print
;
TobiasNx commented 1 year ago

one thing I think is difficult is, that the input within MF for page-http would vary for every paging type. This would make this modul complex in it self.

blackwinter commented 1 year ago

I don't think I have much to add at this point. Except maybe that type 3 could also be useful for Elasticsearch pagination (search_after and/or scroll).

fsteeg commented 1 year ago

Your approach is stuck to Json as result. Also it seems to make use of sitemaps to get all resources - this would be another type "4" (as I understand we want to switch from sitemap to using an API)

You mean the WIP approach I mentioned above? That actually removes using the sitemap to get the list of resources.

fsteeg commented 1 year ago

I'm not sure if it makes sense to approach this in a generic way. It all depends on what API we actually want to talk to. Maybe we should start with implementing paging for specific use cases instead, and then try to generalize that.

fsteeg commented 1 year ago

Discussed in our planning meeting: I will try to implement the approach described by @dr0i in https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1237836374 for edusharing APIs (4 workflows) in OERSI (e.g. ZOERR, see https://gitlab.com/oersi/oersi-etl/-/issues/64).

fsteeg commented 1 year ago

I will try to implement the approach described by @dr0i in https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1237836374 for edusharing APIs (4 workflows) in OERSI

To recall, this was the sketched approach from above:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)
| decode-json(recordpath="pagination.totalItems") // => 1348
| page-http(url="$baseUrlWithPaginationParmeter", size="100")
| print
;

One problem is that conceptually, this part has to be called repeatedly:

'{"criterias": [], "facettes": []}'
| open-http(url="https://www.zoerr.de/edu-sharing/.../ngsearch?skipCount=0)

With different skipCount= values, but always passing the body to POST. So further down, when we want to do something like page-http, we no longer have the {"criterias": [], "facettes": []} body to POST. We also need to process the response of open-http to pass each record on the page, which would work with decode-json and recordPath, but then we'd kind of have to go back to call again, with incremented skipCount=.

To solve this in a generic way as envisioned here, I think we would need some kind of loop construct in the Flux. But that whole setup would become very complex. I think a specific module (or extension of HttpOpener) would be a better way to go. I was going to start with a very specific EdusharingReader, but gladly @TobiasNx and @acka47 stopped me there, asking for something a little more generic, which became a JsonApiReader, and seems to be a good balance of specific and generic, see https://gitlab.com/oersi/oersi-etl/-/merge_requests/227/diffs.

For proceeding here, I suggest we close this issue and keep using that new module for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself.

dr0i commented 1 year ago

Just had a short glimpse on this - the idea is to get all items via invoking open-http and decode-json(now around 1360) and the do it like swissbib : repeatedly getting the document in page-http(url="$baseUrlWithPaginationParmeter", size="100") up to this 1360. The skipCount and the do-loop is done in page-http and stopped when totalItems is reached. Does this not work?

fsteeg commented 1 year ago

the idea is to get all items via invoking open-http and decode-json (now around 1360)

I'm not sure I understand. We can't get them all with a single call, that's why we need the paging. (Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

The skipCount and the do-loop is done in page-http and stopped when totalItems is reached.

If we do the repeated calls in page-http, we need the JSON body from the first line, since that is part of the API request. (At that position, we're getting the totalItems number in your example instead.) We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways). I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

dr0i commented 1 year ago

(Or do you mean the number? In the current implementation in OERSI, we don't even need that, we stop when the API returns empty results.)

Yes, I mean the number. Relying on empty results may be a very good idea - but maybe not, there may be non-empty result containing a valid json saying "nothing here". Idk.

We could add the request body as an option to page-http, but that would be redundant and inconsistent (specifying the request body twice, in different ways).

You could (re)use a variable: `baseUrlWithPaginationParameter='https://www.zoerr.de/edu-sharing/rest/search/v1/queriesV2/-home-/-default-/ngsearch?maxItems=10&skipCount=0&propertyFilter=-all-' -d '{"criterias": [], "facettes": []}''

... | page-http(url="$baseUrlWithPaginationParameter", size="100", setSizeValueForParameter="skipCount") The page-http would use the java class corresponding to open-http (if that's possible), flushing results to downstream modules until skipCount is greater than the input passed to | page-http.

I also don't see what we gain by that separation in the Flux of the first API call and the following API calls.

The idea is to be more generic, independent of a JSON-API or XML or whatever (even http-headers, having the proper module). BUT maybe may thinking has a flaw and this generic approach is to complex in itself (commands piping into command piping into ... + using a variable + setting proper parameters) so this is not a viable approach (even if wrappers (new "commands") could be programmed as an abbreviation for the different APIs).

fsteeg commented 1 year ago

Right, thanks for explaining, now I see how that could be done in a reasonable way. We should keep that in mind for when we have other pagination use cases, this could help avoiding the need for new modules and duplication.

fsteeg commented 1 year ago

Quoting myself from https://github.com/metafacture/metafacture-core/issues/464#issuecomment-1295129003:

For proceeding here, I suggest we close this issue and keep using that new module (JsonApiReader) for some more workflows in OERSI, and consider if and how we want to move it to metafacture-core when a use case in a second project comes up, be it as a dedicated module as currently, or by adding the paging functionality to HttpOpener itself. Edit: Or by implementing generic paging as discussed above.

@TobiasNx, since you opened this issue, could you close if you agree?

TobiasNx commented 1 year ago

I am okay with that!