sul-dlss / dlme-transform

Transforms raw DLME metadata to DLME intermediate representation
Apache License 2.0
0 stars 2 forks source link

update extract_json #1020

Closed jacobthill closed 6 months ago

jacobthill commented 1 year ago

Currently, in order to get all values from each json field, we have to make a separate extract_json call at each index. I need 31 calls on one field and there is no way to be sure we are getting all of the data without building an airflow task to check the maximum length of each field coming out of the harvest task and checking that against the number of calls we make in the traject config. It seems like a better approach would be to update extract_json so it expects a list and plays nice with the other macros that we call afterwards e.g. strip, etc.

extract_json also cannot accept integers.

Exception: NoMethodError: undefined method `empty?' for 1:Integer /opt/traject/lib/macros/each_record.rb:35:in `reject'

e.g. to_field 'cho_contributor', extract_json('.contributor'), strip, unique, arabic_script_lang_or_default('und-Arab', 'en') should get all values from the contributor field.

Here is a json records with many values in the contributor field to test on:

[{
  "id": "https:\/\/figgy.princeton.edu\/concern\/scanned_resources\/d917e0d6-2894-4d1e-939b-93fdc0796c96\/manifest",
  "description": ["Ms. codex.", "Title supplied by cataloger.", "The album consists of 55 folios measuring 485 x 280 mm.", "Origin: The signed paintings and calligraphies are attributed to Mīr ʻAlī, Sulṭān ʻAlī al-Mashhadī, Muḥammad ʻAẓīm al-Ikthīr, ʻImād al-Ḥusaynī, Muḥammad Muʻīn ʻAlī Tajallī (= Shāh Muḥammad Muʻīn ʻAlī Tajallī Chishtī, fol. 32a), Muḥammad Ibrāhīm, ʻAbd Allāh, Muḥammad ʻAlī, Muḥammad Badīʻ al-Iṣfahānī, Tajallī ʻAlī Shāh, Muḥammad Karīm, Anūp Chator (Chatar, Chitor: see Titley), Mīr Muḥammad Māh Ḥusaynī, Zayn al-Ḥaqq, Zarrīn Raqam, Muḥammad Afḍal, Ghulām Muṣṭafá Khān, Muḥammad Aṭhar, Asad Allāh, Mughalkhān, Ghulām Jamāl Allāh Khān, Sayyid ʻAlī Būkhārī Rūshan Raqam, Jawāhir Raqam-i Thānī, Muḥammad Ḥusayn, Muḥammad ʻAlī Gawhar, Muḥammad al-Fakhkhār, Abū al-Baqāʼ al-Mūsawī, Kifāyat Khān, Ismāʻīl, Ilyās Bahādur, Aḥmad Shāh, Muḥammad Dalīr, and date from 1014H. [1630 or 31] to 1189 [1775]. Other undated paintings can be dated to 18th-19th century India and 16th or early 17th century Central Asia. One piece is dated Awrangābād, [1]203 [1788 or 9] (fol. 22a)."],
  "thumbnail": ["https:\/\/iiif-cloud.princeton.edu\/iiif\/2\/5e%2F26%2F9e%2F5e269edd605a48f6baeccb446e35da81%2Fintermediate_file\/full\/!200,150\/0\/default.jpg", "https:\/\/iiif-cloud.princeton.edu\/iiif\/2\/5e%2F26%2F9e%2F5e269edd605a48f6baeccb446e35da81%2Fintermediate_file"],
  "artist": ["Anup Chittaur, 17th cent", "Ilyās Bahādur, 17th cent"],
  "calligrapher": ["Mashhadī, Sulṭān ʻAlī.", "Ḥusaynī, ʻImād, active 17th century", "Mīr ʻAlī, 17th cent", "Muḥammad Muʻīn ʻAlī, 18th cent", "Muḥammad ʻAlī, fl. 1719", "Muḥammad Ibrāhīm, 17th cent", "Iṣfahānī, Muḥammad Badīʻ, fl. 1710", "Tajallī ʻAlī Shāh, active 1775", "Ḥusaynī, Muḥammad ʻĀbid", "Muḥammad Karīm, fl. 1672", "Zayn al-Ḥaqq, active 18th century", "Zarrīn Raqam, active 18th century", "Asad Allāh, active 18th century", "Muḥammad Afḍal, 18th cent", "Muṣṭafá Khān.", "ʻAbd Allāh, active 17th century", "Aṭhar, Muḥammad", "Mughalkhān.", "Jamāl Allāh Khān.", "Dalīr, Muḥammad, 18th cent", "Rūshan Raqam, 18th cent", "Javāhir Raqam-i Thānī.", "Muḥammad ʻAlī Gawhar, 18th cent", "Fakhkhār, Muḥammad", "Mūsawī, Abū al-Baqāʼ, 17th cent", "Kifāyat Khān, 18th cent", "Ismāʻīl, active 18th century", "Aḥmad Shāh, active 18th century", "Iksīr, Muḥammad ʻAẓīm, active 18th century"],
  "former-owner": ["Malcolm, John, 1769-1833"],
  "abstract": ["Album of miniatures and specimens of calligraphy of Indian origin. Described by Mika Natif."],
  "extent": ["55 leaves : paper, col. ill. ; 485 x 280 mm."],
  "identifier": ["<a href='http:\/\/arks.princeton.edu\/ark:\/88435\/bg257f113' alt='Identifier'>http:\/\/arks.princeton.edu\/ark:\/88435\/bg257f113<\/a>"],
  "replaces": ["pudl0032\/102g"],
  "title": ["[Album of miniatures and specimens of calligraphy of Indian origin]."],
  "type": ["Composite portraits", "Portraits", "Art", "Illustrations", "Pictorial works"],
  "provenance": ["A letter attached to the album mentions that this was made for the Portuguese governor of India, but this is a late attribution. The document further indicates that the book belonged to the Delaney family who sold it to Sir John Malcolm (1769-1833) from the East India Company. It was then sold to Rowland Jones Esq. (1772-1856), Broom Hall, Carnarvonshire, Wales. The album was then auctioned in an estate sale between Feb. 24 and March 8, 1857 at Carnarvonshire and was purchased by William Stewart Esq. (1798-1874) of Aldenham Abbey, Hertfordshire. Stewart sold the book at Christie's auction in London in 1875. There was probably another owner after Stewart from whom Robert Garrett (1875-1961) purchased the album."],
  "contributor": ["مشهدي، سلطان علي", "حسيني، عماد", "مير علي", "محمد معين علي", "محمد علي", "محمد ابراهيم", "اصفهاني، محمد بديع", "تجلي علي شاه", "حسيني، محمد عابد", "محمد كريم", "انوپ چتر", "زين الحق", "زرين رقم", "اسد الله", "محمد افضل", "مصطفى خان", "عبد الله", "اطهر، محمد", "مغلخان", "جمال الله خان", "دلير، محمد", "روشن رقم", "جواهر رقم ثاني", "محمد علي گوهر", "فخار، محمد", "موسوي، ابو البقاء", "كفايت خان", "اسماعيل", "الياس بهادر", "احمد شاه", "اكسير، محمد عظيم"],
  "date": ["1600-1899"],
  "language": ["Persian"],
  "local-identifier": ["pmc87qv19t"],
  "publisher": ["[between 16-- and 18--]"],
  "subject": ["Manuscripts, Persian—New Jersey—Princeton"],
  "source-acquisition": ["Gift ; Robert Garrett, Class of 1897 ; 1942."],
  "call-number": ["Islamic Manuscripts, Garrett no. 102G", "Electronic Resource"],
  "location": ["HSVM Islamic Manuscripts, Garrett no. 102G", "HSVM Electronic Resource", "ELF1 Islamic Manuscripts, Garrett no. 102G", "ELF1 Electronic Resource"],
  "electronic-locations": [null],
  "embargo-date": [""],
  "member-of-collections": ["Robert Garrett", "Princeton Digital Library of Islamic Manuscripts", "Collections Donated to Princeton University Library", "Capturing Feathers", "Shahnamah"],
  "author": null,
  "contents": null,
  "uniform-title": null,
  "content-type": null,
  "creator": null,
  "text-language": null,
  "binding-note": null,
  "collector": null,
  "alternative": null
}]

Requirements:

jacobthill commented 6 months ago

Note, I found this https://github.com/sul-dlss/dlme-transform/blob/b434ee43245c14ab5140a1b162ee7d2ae14fc55d/lib/macros/field_extraction.rb#L27 which I think we used in the past as a work around but this is not a great solution because we are automatically refreshing data so a provider could change a list into a string at any time and vice versa.