sul-dlss / dlme-harvest

DLME Scripts for harvesting data from providers
0 stars 0 forks source link

[SPIKE] Investigate bonobo for improving harvesting #89

Closed aaron-collier closed 3 years ago

jermnelson commented 3 years ago

Bonobo Analysis

Bonobo is a pure Python ETL module that is based on a directed graph for managing extraction, transformation, and loading steps in a pipeline.

Pros

Cons

To understand how we could refactor the dmle-harvest into these graphs, I created an example Bonobo graph for IIIF workflow with Python puesdo-code:

# Extract function for IIIF collection
def get_iiif_collection(collection_url: str) -> dict:
    collection_response = requests.get(collection_url)
    yield record_manifest_harvest(collection_response.json().get('manifests'),
                                  record_ids=[])

# Record manifest and harvest data
def record_manifest_harvest(manifests: list, record_ids: list) -> dict:
    for count, manifest in enumerate(manifests):
        if manifest['@id'] in record_ids:
            print(f"Duplicate record: {manifest['@id']}")
            continue
        record_ids.append(manifest['@id'])
        yield transform_record_data(manifest['@id'])

# Transformation IIIF function
def transform_iiif_record_data(record_manifest_url: str)->dict:
    # Return the available fields in the record manifest
    data = {}
    manifest_response = request.get(record_manifest_url)
    record_data = manifest_response.json()
    data['rendering'] = record_data['rendering']['@id']
    data['thumbnail'] = record_data['thumbnail']['@id']
    if 'description' in record_data:
        data['description_top'] = record_data['description']
    for i in record_data['metadata']:
        data[i['label'].lower().replace(" ", "_")] = i['value']
    yield data        

# Load (save) IIIF to JSON output
def save_iiif_json(filename: str, data: dict)-> None:
    iiif_dataframe =  pd.DataFrame(data)
    iiif_dataframe.to_json(filename)

def get_iiif_graph(**options):
    graph = bonobo.Graph()
    graph.add_chain(get_iiif_collection, 
                    record_manifest_harvest,
                    transform_iiif_record_data,
                    save_iiif_json)

For custom IIIF processing, the function transform_iiif_record_data could become a decorator on specific sources while retaining the other steps in the iiif_graph. This refactoring may still be a good candidate for the harvest scripts even if we don't use Bonobo as the eventual ETL framework. For example, as part of setting up Apache Airflow directed graphs.

aaron-collier commented 3 years ago

Moving to analysis doc and closing.