Penn musuem has 3 collections. They provide all museum data in one large csv file which we filter for each collection to get only the records for that collection. The csv file does not contain thumbnails for we run a post_harvest task that fetches the thumbnail from the schema.org data. We then delete all records with no thumbnail. The Babylonian and Egyptian collections work as expected but the Near Eastern collection does not. The log for the post_harvest task shows that it queried all urls for thumbnails and clicking on many of them will show that there are thumbnails. But the record that gets saved to :/opt/app/dlme/dlme-airflow/shared/source_data/penn_museum/near_eastern is an empty json file. This causes the rest of the DAG to succeed with no changes. I'm not sure how to debug this but we need to be careful to not hit Penn's site more than necessary.
Penn musuem has 3 collections. They provide all museum data in one large csv file which we filter for each collection to get only the records for that collection. The csv file does not contain thumbnails for we run a post_harvest task that fetches the thumbnail from the schema.org data. We then delete all records with no thumbnail. The Babylonian and Egyptian collections work as expected but the Near Eastern collection does not. The log for the
post_harvest
task shows that it queried all urls for thumbnails and clicking on many of them will show that there are thumbnails. But the record that gets saved to:/opt/app/dlme/dlme-airflow/shared/source_data/penn_museum/near_eastern
is an empty json file. This causes the rest of the DAG to succeed with no changes. I'm not sure how to debug this but we need to be careful to not hit Penn's site more than necessary.