the-library-code / dspace-rest-python

DSpace REST API Client Library
BSD 3-Clause "New" or "Revised" License
25 stars 21 forks source link

Add iterator-based methods for automatic pagination #25

Open dpk opened 3 days ago

dpk commented 3 days ago

For the methods such as get_bundles, get_bitstreams, get_communities etc. which involve pagination, this currently has to be done manually by the API consumer.

It would be nice to have iterator-based variants of these which automatically request the next page of data from the REST API when requested. From what I can tell of how the underlying REST API works, the pattern to implement this would be something like:

page_number = 0
current_page = []
# it may be easier to read this `while not` as ‘until’
while not (page_number > 0 and len(current_page) == 0):
    current_page = get_page_from_api(page_number)
    for item in current_page:
        yield rest_json_to_python_object(item)
    page_number += 1

Unfortunately it seems to me that DSpace pagination is always in array style (where you request an absolute page number) and not in linked list style (where every request gives you a unique ID for the first item on the next page, and you request the page that starts with that unique ID). This means there is a race condition with getting paginated data if some more data is added to a collection between getting page n and page n+1. If there is a way to fix this at the DSpace level, an iterator-based protocol such as this should ideally use that instead.

kshepherd commented 3 days ago

DSpace pagination actually does give next links in the _links section, but only if there is more than one page in the results, you can see it if you force page size to 1 for small lists, e.g.

{
....
    .... embedded search results ...
      },
      "page" : {
        "number" : 0,
        "size" : 1,
        "totalPages" : 8,
        "totalElements" : 8
      },
      "_links" : {
        "next" : {
          "href" : "http://localhost:8080/server/api/discover/search/objects?page=1&size=1"
        },
        "last" : {
          "href" : "http://localhost:8080/server/api/discover/search/objects?page=7&size=1"
        },
        "self" : {
          "href" : "http://localhost:8080/server/api/discover/search/objects?size=1"
        }
      }
    },

Same with a last link for previous page. They're still handled with absolute page and size numbers in the backend, and won't address the race condition you mention but it at least gives consumers a consistent link to follow / detect (when it's not there, there are no more pages). (the issue of how the hateoas / dspace pagination works is a separate issue to how easy it is for our client lib to consume but could be worth raising at some stage)