There seem to be some issues related to the output format. It is unclear whether it is necessary to add output_format: json to the catalog. It seems to harvest both json and csv whether or not this is present in the catalog. I just harvested the Penn collections, which take ~31 hours to run the post harvest task and when loading the data, I realized there are no thumbnails present. Trying to debug this by running airflow locally, I notice that the post harvest task does nothing to the json file produced from the harvest task. It only changes the csv file. My suspicion is that since we are mapping from the json file in traject, we didn't pick up any of the thumbnails that were added in the post harvest task.
Before re-running the Penn collection, this issue should also be resolved.
There seem to be some issues related to the output format. It is unclear whether it is necessary to add
output_format: json
to the catalog. It seems to harvest both json and csv whether or not this is present in the catalog. I just harvested the Penn collections, which take ~31 hours to run the post harvest task and when loading the data, I realized there are no thumbnails present. Trying to debug this by running airflow locally, I notice that the post harvest task does nothing to the json file produced from the harvest task. It only changes the csv file. My suspicion is that since we are mapping from the json file in traject, we didn't pick up any of the thumbnails that were added in the post harvest task.Before re-running the Penn collection, this issue should also be resolved.