sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Account for strange LOC paging #473

Closed aaron-collier closed 6 months ago

aaron-collier commented 6 months ago

This is to account for the strange LOC paging.

Paging data:

...
  "pagination": {
    "current": 1,
    "first": null,
    "from": 1,
    "last": "https://www.loc.gov/collections/abdul-hamid-ii-books/?c=100&fo=json&sp=4",
    "next": "https://www.loc.gov/collections/abdul-hamid-ii-books/?c=100&fo=json&sp=2",
    "of": 320,
    "page_list": [
      {
        "number": 1,
        "url": null
      },
      {
        "number": 2,
        "url": "https://www.loc.gov/collections/abdul-hamid-ii-books/?c=100&fo=json&sp=2"
      },
      {
        "number": 3,
        "url": "https://www.loc.gov/collections/abdul-hamid-ii-books/?c=100&fo=json&sp=3"
      },
      {
        "number": "...",
        "url": "https://www.loc.gov/collections/abdul-hamid-ii-books/?c=100&fo=json&sp=4"
      }
    ],
    "perpage": 100,
    "perpage_options": [
      25,
      50,
      100,
      150
    ],
    "previous": null,
    "results": "1 - 100",
    "to": 100,
    "total": 4
  },
...

In the above example, page_list is both incomplete and unreliable. it will always only include 4 entries, the null page, the next 2 pages, and the last page (from my light analysis). Using the total (which is the total number of pages) and an increment of 1 we can build the pages by increment, but force the inclusion of the last page (which oddly violates the notion of zero based here).