scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

Spike: Adventist Collection Situation #459

Open ShanaLMoore opened 1 year ago

ShanaLMoore commented 1 year ago

The goal of this ticket is to provide a pathway forward with the Collection behavior for Adventist.

Related Tickets

Spike Results

In Adventist there are two parsers: CSV and OAI. For the OAI parsing we have adjusted our approach from default Bulkrax. In Bulkrax’s default we assume that all collections, as identified by the record header setSpec for work-type objects will be created before running the importers.

However, in Adventist, we have a set (e.g. “adl:periodical”) that describes collections in the children elements of the record metadata oai_adl element.

Examples and Commentary

First let’s look at two of the OAI sets’s records to see their structure.

An “adl:periodical” entry: ```xml
20000026 2022-12-15T05:09:20Z adl:periodical
20000026 Pertandaan Zaman Periodical 1901-01-01 Malay v. illus. 4to and fol. Center for Adventist Research Singapore The Signs Press No Copyright - United States Rumah Tangga Dan Kesehatan https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.OBJ.jpg https://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.TN.jpg Collection
```

In the default Bulkrax::OaiParser we’d import the above entry as a work (of type “Collection”). That is because the default assumption is that records are works. Note, this would partially work.

To account for this difference, we created Bulkrax::OaiAdventistQdcParser which extends the Bulkrax::OaiQualifiedDcParser and heavily modifies to sniff out records that are of <work_type>Collection</work_type> and import as a collection entry.

One notable difference is that because we’re importing the record as a collection, we don’t have the constraint around collections that we later have for works.

In other words, for the above “adl:periodical” entry, we ignore all record header setSpec elements.

An “adl:issue” entry ```xml
20088752 2022-12-15T05:09:20Z 20000062 adl:issue
20088752 Atlantic-Union-Gleaner_0001000119020101 Atlantic Union Gleaner | January 1, 1902 Issue 1 1 English Office of Archives, Statistics, and Research 42.45022,-71.682072 South Lancaster, Massachusetts, USA Atlantic Union Conference of Seventh-day Adventists https://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.ARCHIVAL.pdf;https://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.X1.RAW.txt https://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.TN.jpg PublishedWork
```

In the “adl:issue” we have two record header setSpec elements: “adl:issue” and “20000062”. The “20000062” should correspond to a record metadata aark_id of a record found in the “adl:periodical” OAI set. In other words, the collection that this issue is part of. The “adl:issue” is either something to discard.

We assume that the “adl:periodical” import has completed successfully before proceeding with importing the “adl:issue”. That way all collections for the issues are created before running the import of issues.

Logical Conflicts

In default Bulkrax, we have a method collections_created?. This is used as a guard clause for importers (via the Bulkrax::ImportBehavior#build_for_importer method).

For the default Bulkrax::OaiEntry that is based on either the magical set named “all” and/or that there are only 1 elements in the #collection_ids. Which are found by the Bulkrax::OaiEntry#find_collection_ids method.

In ADL:Issues import stuck in pending · Issue scientist-softserv/adventist_knapsack#460 · scientist-softserv/adventist-dl, we have a state where Katharine has imported the “adl:periodical” set. And is now attempting to import the “adl:issue” set. These are stuck. Looking further at the exceptions, we’re encountering a RuntimeError that is not bubbling up to the importer.

That RuntimeError is hitting the Bulkrax::CollectionsCreatedError. In other words, for this record, we don’t have a created collection; even though we have parsed metadata of =member_of_collections_attributes: {"0"=>{"id"=>"ffd7d24c-2d65-4390-8394-6c33e01e2cbe"}}=… which appears to exist (see Atlantic Union Gleaner // Adventist Digital Library).

Looking at the CSV Situation

The Bulkrax::CsvEntry#collections_created? always returns true. So we are not likely to encounter the same problem.

Proposal

Reviewing this, we clearly have a case where logic is incorrect or odd. My suspicion is that the Bulkrax::OaiEntry#collections_created? logic is leaky. We’re not working with the “all” collection, so instead we say collections were created if we have one and only one associated “collection_id”. But, based on our example “adl:issue” record, we have two record header setSpec elements.

ShanaLMoore commented 1 year ago

TODO per Jeremy:

test w live record of a collection it thinks should already exists.