sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Resolve inconsistencies between csv and json output #568

Closed jacobthill closed 2 weeks ago

jacobthill commented 2 weeks ago

There seem to be some issues related to the output format. It is unclear whether it is necessary to add output_format: json to the catalog. It seems to harvest both json and csv whether or not this is present in the catalog. I just harvested the Penn collections, which take ~31 hours to run the post harvest task and when loading the data, I realized there are no thumbnails present. Trying to debug this by running airflow locally, I notice that the post harvest task does nothing to the json file produced from the harvest task. It only changes the csv file. My suspicion is that since we are mapping from the json file in traject, we didn't pick up any of the thumbnails that were added in the post harvest task.

Before re-running the Penn collection, this issue should also be resolved.