scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

OAI Import question for adl:periodical set: #579

Closed jeremyf closed 1 year ago

jeremyf commented 1 year ago

I ran an import on adl:periodical with a limit of 100 works. No works imported, and the importer just created a strangely-named and empty collection. What caused this behavior? I'm also asking Eric what he knows about the periodical set. I hope he can tell me if there are works in that set that failed to import.

From https://docs.google.com/document/d/1mIOT23UAilSO77pAlXYSWJEHw3YK3BNNQVTKNzd41ao/edit#

Testing Criteria

Note the counts on the parser will be off. See this PR for reasons

I have taken efforts to properly address the parser's counts/totals but these things are quite challening given their upstream Bulkrax implementation. In other words expect the following counts to not show the correct summary totals.

  • Total Works
  • Total Collections
  • Total File Set

Further, I have not addressed the collection's thumbnails. There is custom logic for uploading an image to a collection. I have written Issue scientist-softserv/adventist_knapsack#567 to track this functionality; PR scientist-softserv/adventist-dl#131 relates to this issue.

What does reviewing look look? With the changes of this commit, you should be able to see the raw metadata and parsed metadata for each imported collection. What we want to then see is that raw and parsed metadata on the imported collections.

What I Suspect will be the Raw Metadata ``` 20000026 Pertandaan Zaman Periodical 1901-01-01 Malay v. illus. 4to and fol. Center for Adventist Research Singapore The Signs Press No Copyright - United States Rumah Tangga Dan Kesehatan https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.OBJ.jpg https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.TN.jpg Collection ``` ```xml Signs of the Times presents articles that are considered to be helpful in assisting readers to live in modern society. The magazine focuses on lifestyle issues, health articles and Christian devotional and other religious articles. From its historical roots, the magazine emphasizes the second coming of Christ to this earth and living such lives so as to be able to meet Jesus at His second coming. 20000027 (OCoLC)2268533.; 0037-5047; 0040-6058; 64048061 //r84; sn 78000386 Signs of the Times Periodical 1874-01-01 English Published in Oakland Jun 1874 to Aug 1904; Mountain View Sep 1904 to 1984. Published by James White Jun 1874 to Sep 1874; Feb 1875 to 15 Apr 1875. California Conference Oct 1874 to Jan 1875. Pacific SDA Publishing Association 22 Apr 1875 to 22 Mar 1880. SDA Missionary Society 18 Mar 1880 to 7 Feb 1884. International Tract and Missionary Society 14 Feb 1884 to Jun 1890. Pacific Press Publishing Company Jul 1890 to Dec 1908. Pacific Press Publishing Association Jan 1909 to 1984. v. ill. 25-28. Center for Adventist Research Oakland, California, USA; Mountain View, California, USA James White; California Conference of Seventh-day Adventists; Pacific SDA Publishing Association; The SDA Missionary Society; International Tract and Missionary Society; Pacific Press Publishing Company; Pacific Press Publishing Association No Copyright - United States Bible Prophecies; Christian Life https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000027/20000027.OBJ.jpg https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000027/20000027.TN.jpg Collection ``` ```xml 20000028 0030-6894; sn 78005141 Our Little Friend Periodical 1890-01-01 English v. Center for Adventist Research Oakland, California, USA; Mountain View, California, USA; Boise, Idaho, USA Pacific Press Publishing Association No Copyright - United States Children -- Religious life; Periodicals https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000028/20000028.OBJ.jpg https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000028/20000028.TN.jpg Collection ```
KatharineV commented 1 year ago

Eric checked on our end, and he told me that the adl:periodical set contains the "collections" that the adl:issue works fall into. For this reason, I'm placing ticket scientist-softserv/adventist_knapsack#579 high on the priority list because based on my reading of his answer, I need to run a complete adl:periodical import before running adl:issue.

Here is the question I sent to Eric and the answer he gave me, for you to check my reasoning.

KATHARINE: ADL Bulkrax Import question: We ran an import on adl:periodical with a limit of 100 works. No works imported, and the importer just created a strangely-named and empty collection. I’ve asked SoftServ to look into the Hyku side of what happened here. But from our end, what do we expect from the adl:periodical set? What is and/or should be in that set? Is there a reason on our end why Bulkrax found no works? 

 

ERIC: As far as the auto collection creation thing, the idea was that Bulkrax would create a periodical master collection the first time it saw one and then load the correct records into it. However doing the “is this a unique periodical” query each time a record hits the system was way too time consuming. So Rob turned off that check at one point. After the load was completed, he went back and wrote a script that automatically cleaned things up, grouping records by the name of their parent collections. Since under the hood each of what appears to be a duplicate collection has its own unique hyku id, Rob was able to automatically decide which collection would be the “keeper” and then move all the issues out of the “duplicate” collections into the one his code selected as the “keeper”. He then deleted all of the empty collections that were left. The reason you are not seeing one collection for each periodical issue is that the “duplicates” are created at the import batch level and not the item import level, if I recall correctly. As far as not importing works, I am seeing returns for both periodical related sets.

https://oai.adventistdigitallibrary.org/OAI-script?verb=ListIdentifiers&set=adl%3Aissue

https://oai.adventistdigitallibrary.org/OAI-script?verb=ListIdentifiers&set=adl%3Aperiodical

The idea was, not sure what SS wants now, to import the periodical set first and then import the issues. This was so that the periodical collection that holds the issues would get out metadata and not be one of those “auto created” collections.

DeonFranklin commented 1 year ago

This ticket passes Soft Serv QA.

KatharineV commented 1 year ago

I followed the testing instructions and created a new adl:periodical set importer on staging with a limit of 3 collections. The importer is done, but I don't see links to collections that I can test. The importer page only links one collection, and it is not a real collection (i.e. it isn't pulling from our OAI feed and it doesn't reflect any of the collection info I expect to see; it appears to be created by the importer and it is named for the set spec). Importer is here: https://adl.s2.adventistdigitallibrary.org/importers/32?locale=en

jeremyf commented 1 year ago

@KatharineV when in the importer page, click on the "Collection Entries"

Image

Then click on a collection:

Image

Finally, click on the "Collection Link: Collection"

KatharineV commented 1 year ago

All looks good on staging. I tested the collections in this import. One is stuck "pending," but @jeremyf knows about that collection. It got stuck before.

KatharineV commented 1 year ago

I ran a test import of 3 periodicals on production, and it completed. The raw metadata looks right with these exceptions:

  1. Parent Collection (part_Of) for Pertandaan Zaman is still incorrectly showing The Southern Watchman, which doesn't match the part_Of raw metadata.

  2. Fields with multiple values are not splitting at the semi-colon, so multiple values are displaying as a single metadata field. Example: Signs of the Times has multiple publishers and two subjects, but they failed to split.

These issues have been noted in other tickets, so I'm just restating it here to clarify that this test of 3 periodicals appears to work in general while continuing to have the specific problems that are being worked on overall. Thanks!

KatharineV commented 1 year ago

Clarification and update on my comment above:

I ran a test import of 3 periodicals on staging, and Pertandaan Zaman shows the correct textual metadata in the Part of field, but it is physically a part of the wrong Parent Collection.

jeremyf commented 1 year ago

@KatharineV I fixed the underlying issue regarding assigning a collection to another collection. One challenge is that my read of the logic is that each import will add to the existing relationship (e.g. the collection listing).

What I did was manually remove the relationships and then re-ran the importer. I believe the metadata for Pertandaan Zamaan shows the correct part of as well as collection relationship.

KatharineV commented 1 year ago

I created a periodical import on ADL staging to run through the testing instructions for this ticket, and it is stuck pending. No entries are showing up. The import was set to bring in the first 3 periodicals again, to confirm that they land in their proper collection relationships. If the importer doesn't move out of pending, I will not be able to test this ticket. I'll update here if things change, but for now I'm assuming something is wrong and I won't get to finish testing...

Edited to add that I also set up an importer on the SDAPI tenant staging environment, just to see if the tenant was the issue. The importers are stuck in both places.

KatharineV commented 1 year ago

Tried to test this ticket today (2/28/23) and the importer is stuck pending.

jeremyf commented 1 year ago

I went ahead and edited the importer via the UI. I clicked "Update and Re-Harvest All Items"; that appeared to unstick the importer.

KatharineV commented 1 year ago

Periodicals are importing to Staging as expected. A test import today (2023-03-13) completed without any problems.

KatharineV commented 1 year ago

Working as expected on production.