internetarchive / iiif

The official Internet Archive IIIF service
GNU General Public License v3.0
21 stars 4 forks source link

Migrate iiif.archive.org/iiif off of Labs APIs #66

Open mekarpeles opened 3 months ago

mekarpeles commented 3 months ago

Currently hitting https://iiif.archive.org/iiif still hits https://api.archivelab.org/iiif. The Labs APIs are likely going to go away ~this year and so we should try to move that dependent code out of api.archivelab.org/iiif into the iiif service itself.

Here we can see where/how iiif.archive.org calls to api.archivelab.org/iiif for the purpose of generating a searchable json list of items available to be accessed in iiif format:

https://github.com/internetarchive/iiif/blob/main/iiify/resolver.py#L32-L38

The corresponding code on api.archivelab.org/iiif is the Catalog class: https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/views/apis/v1/iiif.py#L21-L34

def get(self, page=1, limit=1000):
        q = request.args.get('q', '')
        query = "(mediatype:(texts) OR mediatype:(image))" + \
                ((" AND %s" % q) if q else "")
        fields = request.args.get('fields', '')
        sorts = request.args.get('sorts', '')
        cursor = request.args.get('cursor', '')
        version = 'v2'
        limit = 1000
        return items(page=page, limit=limit, fields=fields, sorts=sorts,
                     query=query, cursor=cursor, version=version)

which calls items: https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/api/archive.py#L303-L314

def items(iid=None, query="", page=1, limit=100, fields="", sorts="",
          cursor=None, version=''):
    # aaron's idea: Weekly dump of ID of all identifiers (gzip)
    # elastic search query w/ paging
    if iid:
        return item(iid)
    # 'all:1' also works
    q = "NOT identifier:..*" + (" AND (%s)" % query if query else "")
    if version == 'v2':
        return scrape(query=q, fields=fields, sorts=sorts, count=limit,
                      cursor=cursor)
    return search(q, page=page, limit=limit)

Which either calls item or scrape or search:

def item(iid):
    try:
        return requests.get('%s/metadata/%s' % (API_BASEURL, iid)).json()
    except ValueError as v:
        return v

def scrape(query, fields="", sorts="", count=100, cursor="", security=True):
    """
    params:
        query: the query (using the same query Lucene-like queries supported by Internet Archive Advanced Search.
        fields: Metadata fields to return, comma delimited
        sorts: Fields to sort on, comma delimited (if identifier is specified, it must be last)
        count: Number of results to return (minimum of 100)
        cursor: A cursor, if any (otherwise, search starts at the beginning)
    """
    if not query:
        raise ValueError("GET 'query' parameters required")

    if int(count) > 1000 and security:
        raise MaxLimitException("Limit may not exceed 1000.")

    #sorts = sorts or 'date+asc,createdate'
    fields = fields or 'identifier,title'

    params = {
        'q': query
    }
    if sorts:
        params['sorts'] = sorts
    if fields:
        params['fields'] = fields
    if count:
        params['count'] = count
    if cursor:
        params['cursor'] = cursor

    r = requests.get(SCRAPE_API, params=params)
    return r.json()

def search(query, page=1, limit=100, security=True, sort=None, fields=None):
    if not query:
        raise ValueError("GET query parameters 'q' required")

    if int(limit) > 1000 and security:
        raise MaxLimitException("Limit may not exceed 1000.")

    sort = sort or 'sort%5B%5D=date+asc&sort%5B%5D=createdate'
    fields = fields or 'identifier,title'
    return requests.get(
        ADVANCED_SEARCH + sort,
        params={'q': query,
                'rows': limit,
                'page': page,
                'fl[]': fields,
                'output': 'json',
            }).json()
mekarpeles commented 3 months ago

TL;DR --

Change the functionality of https://github.com/internetarchive/iiif/blob/main/iiify/resolver.py#L32-L38 so instead of calling api.archivelab.org external API, we instead move its code (https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/views/apis/v1/iiif.py#L21-L34) here.