htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Feature request: volume availability check #22

Closed rburke2233 closed 5 years ago

rburke2233 commented 6 years ago

I would like to have a feature whereby one can check if a given Hathi ID is part of the EF dataset or not. As far as I can tell, the only way to do this now is to try to download (via rsync) and see if you are successful.

An API call something like:

file_available(ids)

Returning a list of true or false values accordingly.

organisciak commented 6 years ago

Here's a solution using an endpoint we already have. I won't have time for a few days to package it and write tests.

EF_CHECK_URL= "https://data.analytics.hathitrust.org/features/get?ids={}"

def files_available(ids):
    url = EF_CHECK_URL.format(",".join(ids))
    results = pd.read_json(url, orient='index', typ='series', convert_dates=False)
    return results[ids].tolist()

Using it:

ids = ['fakeid', 'mdp.39015033487193', 'fakeid2', 'hvd.hn3fhi', 'hvd.hxdan3']
files_available(ids)
[False, True, False, True, True]

If you use this before I add it to the library, you'll have to import pandas as pd.