sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

AUC abdullah-freres collection failing transform validation #386

Open jacobthill opened 1 year ago

jacobthill commented 1 year ago

Exception: ERROR: failed to transform all harvested records: harvested record count (48) != transformed record count (51)

I confirmed 48 records harvested by running harvest again locally. I don't know why the transformed record count is 51. This is strange behavior. I only expected the the transformed record count could be lower. Bad data or errors in the config might prevent some records from being transformed. This is adding 3 records and I have no idea where they would come from. I suspect traject or another ariflow task might be adding blank lines?

auc_ahmed_toughan, auc_coptic, auc_maps, etc. worked

edsu commented 1 year ago

We noticed that one JSON object in the Transform log was getting split over multiple lines?

{"cho_title":{"en":["Luxor Temple"]},"cho_creator":{"en":["Abdullah Frères"]},"cho_date":{"en":["1890-1899"]},"cho_date_range_hijri":[1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317],"cho_date_range_norm":[1890,1891,1892,1893,1894,1895,1896,1897,1898,1899],"cho_dc_rights":{"en":["To inquire about permissions or reproductions, contact the Rare Books and Special Collections Library, The American University in Cairo at +20.2.2615.3676 or rbscl-ref@aucegypt.edu."]},"cho_description":{"en":["Temple de Luxor. 109A"]},"cho_edm_type":{"en":["Image"],"ar-Arab":["صورة"]},"cho_format":{"en":["image/jpg"]},"cho_has_type":{"en":["Other Images"],"ar-Arab":["صور أخرى"]},"cho_is_part_of":{"en":["19th Century photographs"]},"cho_medium":{"en":["photographic prints"]},"cho_type":{"en":["Still Image"]},"agg_data_provider":{"en":["American University in Cairo"],"ar-Arab":["الجامعة الأمريكية في القاهرة"]},"agg_provider":{"en":["American Uni
[2023-03-16, 17:22:21 UTC] {docker.py:373} INFO - versity in Cairo
[2023-03-16, 17:22:21 UTC] {docker.py:373} INFO - "],"ar-Arab":["الجامعة الأمريكية في القاهرة"]},"agg_provider_country":{"en":["Egypt"],"ar-Arab":["مصر"]},"agg_data_provider_country":{"en":["Egypt"],"ar-Arab":["مصر"]},"cho_type_facet":{"en":["Image","Image:Other Images"],"ar-Arab":["صورة","صورة:صور أخرى"]},"id":"p15795coll38:91","transform_version":"923910e","transform_timestamp":"2023-03-16 17:22:21 +0000","agg_data_provider_collection_id":"auc-abdullah-freres","dlme_source_file":"/auc/abdullah_freres/data.csv","agg_is_shown_at":{"wr_dc_rights":["To inquire about permissions or reproductions, contact the Rare Books and Special Collections Library, The American University in Cairo at +20.2.2615.3676 or rbscl-ref@aucegypt.edu."],"wr_format":["image/jpeg"],"wr_is_referenced_by":["https://cdm15795.contentdm.oclc.org/iiif/p15795coll38:91/manifest.json"],"wr_id":"https://cdm15795.contentdm.oclc.org/iiif/2/p15795coll38:91/full/full/0/default.jpg"},"agg_preview":{"wr_dc_rights":["To inquire about permissions or reproductions, contact the Rare Books and Special Collections Library, The American University in Cairo at +20.2.2615.3676 or rbscl-ref@aucegypt.edu."],"wr_format":["image/jpeg"],"wr_is_referenced_by":["https://cdm15795.contentdm.oclc.org/iiif/p15795coll38:91/manifest.json"],"wr_id":"https://cdm15795.contentdm.oclc.org/iiif/2/p15795coll38:91/full/400,400/0/default.jpg"}}
edsu commented 1 year ago

If you look at line 13 in the transformed outupt at dlme-airflow-dev:/opt/app/dlme/datashare/output-auc-abdullah-freres.ndjson you can see a truncated line:

{"cho_titl

and another on line 25, and another on line 30.

The original CSV data for what should be on line 13 looks like a complete CSV line without line breaks:

http://iiif.io/api/presentation/2/context.json,https://cdm15795.contentdm.oclc.org/iiif/p15795coll38:64/manifest.json,image/jpeg,http://iiif.io/api/image/2/level1.json,https://cdm15795.contentdm.oclc.org/iiif/2/p15795coll38:64/full/full/0/default.jpg,['Albumin'],['Abdullah Frères'],['1890-1899'],['Le champs de bataille à Toski. 84A'],['image/jpg'],['19th Century photographs'],['26.7 x 20.7'],"['To inquire about permissions or reproductions, contact the Rare Books and Special Collections Library, The American University in Cairo at +20.2.2615.3676 or rbscl-ref@aucegypt.edu.']",['photographic prints'],['General views; landscape; Toshka'],['War field in Toski'],['Still Image'],['Rare Books and Special Collections Digital Library '],"['<span>From: <a href=""http://digitalcollections.aucegypt.edu/digital/collection/p15795coll38/id/64"">War field in Toski</a></span>']"

I think this points to some kind of problem in traject / dlme-transform? Maybe it's a problem that we're only noticing now that the transform validation is running again?