Bonobo is a pure Python ETL module that is based on a directed graph for managing extraction, transformation, and loading steps in a pipeline.
Pros
Low overhead
Good (not great) documentation
May be appropriate as a lightweight ETL library for DLME
Cons
Project has limited development activity in the past year
We will likely need to extend the base Bonobo Node reader to handle the Feed XML and OAI-PMH
Visualizations
To understand how we could refactor the dmle-harvest into these graphs, I created an example Bonobo graph for IIIF workflow with Python puesdo-code:
# Extract function for IIIF collection
def get_iiif_collection(collection_url: str) -> dict:
collection_response = requests.get(collection_url)
yield record_manifest_harvest(collection_response.json().get('manifests'),
record_ids=[])
# Record manifest and harvest data
def record_manifest_harvest(manifests: list, record_ids: list) -> dict:
for count, manifest in enumerate(manifests):
if manifest['@id'] in record_ids:
print(f"Duplicate record: {manifest['@id']}")
continue
record_ids.append(manifest['@id'])
yield transform_record_data(manifest['@id'])
# Transformation IIIF function
def transform_iiif_record_data(record_manifest_url: str)->dict:
# Return the available fields in the record manifest
data = {}
manifest_response = request.get(record_manifest_url)
record_data = manifest_response.json()
data['rendering'] = record_data['rendering']['@id']
data['thumbnail'] = record_data['thumbnail']['@id']
if 'description' in record_data:
data['description_top'] = record_data['description']
for i in record_data['metadata']:
data[i['label'].lower().replace(" ", "_")] = i['value']
yield data
# Load (save) IIIF to JSON output
def save_iiif_json(filename: str, data: dict)-> None:
iiif_dataframe = pd.DataFrame(data)
iiif_dataframe.to_json(filename)
def get_iiif_graph(**options):
graph = bonobo.Graph()
graph.add_chain(get_iiif_collection,
record_manifest_harvest,
transform_iiif_record_data,
save_iiif_json)
For custom IIIF processing, the function transform_iiif_record_data could become a decorator on specific sources while retaining the other steps in the iiif_graph. This refactoring may still be a good candidate for the harvest scripts even if we don't use Bonobo as the eventual ETL framework. For example, as part of setting up Apache Airflow directed graphs.
Bonobo Analysis
Bonobo is a pure Python ETL module that is based on a directed graph for managing extraction, transformation, and loading steps in a pipeline.
Pros
Cons
To understand how we could refactor the
dmle-harvest
into these graphs, I created an example Bonobo graph for IIIF workflow with Python puesdo-code:For custom IIIF processing, the function
transform_iiif_record_data
could become a decorator on specific sources while retaining the other steps in the iiif_graph. This refactoring may still be a good candidate for the harvest scripts even if we don't use Bonobo as the eventual ETL framework. For example, as part of setting up Apache Airflow directed graphs.