In Adventist there are two parsers: CSV and OAI. For the OAI parsing we have adjusted our approach from default Bulkrax. In Bulkrax’s default we assume that all collections, as identified by the record header setSpecfor work-type objects will be created before running the importers.
However, in Adventist, we have a set (e.g. “adl:periodical”) that describes collections in the children elements of the record metadata oai_adl element.
Examples and Commentary
First let’s look at two of the OAI sets’s records to see their structure.
An “adl:periodical” entry:
```xml
200000262022-12-15T05:09:20Zadl:periodical20000026Pertandaan ZamanPeriodical1901-01-01Malayv. illus. 4to and fol.SingaporeThe Signs PressNo Copyright - United StatesRumah Tangga Dan Kesehatanhttps://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.OBJ.jpghttps://adl-ebstore-repo.s3.amazonaws.com/20/0000/20000026/20000026.TN.jpgCollection
```
In the default Bulkrax::OaiParser we’d import the above entry as a work (of type “Collection”). That is because the default assumption is that records are works. Note, this would partially work.
To account for this difference, we created Bulkrax::OaiAdventistQdcParser which extends the Bulkrax::OaiQualifiedDcParser and heavily modifies to sniff out records that are of <work_type>Collection</work_type> and import as a collection entry.
One notable difference is that because we’re importing the record as a collection, we don’t have the constraint around collections that we later have for works.
In other words, for the above “adl:periodical” entry, we ignore all record header setSpec elements.
An “adl:issue” entry
```xml
200887522022-12-15T05:09:20Z20000062adl:issue20088752Atlantic-Union-Gleaner_0001000119020101Atlantic Union Gleaner | January 1, 1902Issue11English42.45022,-71.682072South Lancaster, Massachusetts, USAAtlantic Union Conference of Seventh-day Adventistshttps://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.ARCHIVAL.pdf;https://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.X1.RAW.txthttps://adl-ebstore-repo.s3.amazonaws.com/20/0887/20088752/20088752.TN.jpgPublishedWork
```
In the “adl:issue” we have two record header setSpec elements: “adl:issue” and “20000062”. The “20000062” should correspond to a record metadata aark_id of a record found in the “adl:periodical” OAI set. In other words, the collection that this issue is part of. The “adl:issue” is either something to discard.
We assume that the “adl:periodical” import has completed successfully before proceeding with importing the “adl:issue”. That way all collections for the issues are created before running the import of issues.
For the default Bulkrax::OaiEntry that is based on either the magical set named “all” and/or that there are only 1 elements in the #collection_ids. Which are found by the Bulkrax::OaiEntry#find_collection_ids method.
That RuntimeError is hitting the Bulkrax::CollectionsCreatedError. In other words, for this record, we don’t have a created collection; even though we have parsed metadata of =member_of_collections_attributes: {"0"=>{"id"=>"ffd7d24c-2d65-4390-8394-6c33e01e2cbe"}}=… which appears to exist (see Atlantic Union Gleaner // Adventist Digital Library).
Looking at the CSV Situation
The Bulkrax::CsvEntry#collections_created? always returns true. So we are not likely to encounter the same problem.
Proposal
Reviewing this, we clearly have a case where logic is incorrect or odd. My suspicion is that the Bulkrax::OaiEntry#collections_created? logic is leaky. We’re not working with the “all” collection, so instead we say collections were created if we have one and only one associated “collection_id”. But, based on our example “adl:issue” record, we have two record header setSpec elements.
The goal of this ticket is to provide a pathway forward with the Collection behavior for Adventist.
Related Tickets
Spike Results
In Adventist there are two parsers: CSV and OAI. For the OAI parsing we have adjusted our approach from default Bulkrax. In Bulkrax’s default we assume that all collections, as identified by the
record header setSpec
for work-type objects will be created before running the importers.However, in Adventist, we have a set (e.g. “adl:periodical”) that describes collections in the children elements of the
record metadata oai_adl
element.Examples and Commentary
First let’s look at two of the OAI sets’s records to see their structure.
An “adl:periodical” entry:
```xmlIn the default
Bulkrax::OaiParser
we’d import the above entry as a work (of type “Collection”). That is because the default assumption is that records are works. Note, this would partially work.To account for this difference, we created
Bulkrax::OaiAdventistQdcParser
which extends theBulkrax::OaiQualifiedDcParser
and heavily modifies to sniff out records that are of<work_type>Collection</work_type>
and import as a collection entry.One notable difference is that because we’re importing the record as a collection, we don’t have the constraint around collections that we later have for works.
In other words, for the above “adl:periodical” entry, we ignore all
record header setSpec
elements.An “adl:issue” entry
```xmlIn the “adl:issue” we have two
record header setSpec
elements: “adl:issue” and “20000062”. The “20000062” should correspond to arecord metadata aark_id
of a record found in the “adl:periodical” OAI set. In other words, the collection that this issue is part of. The “adl:issue” is either something to discard.We assume that the “adl:periodical” import has completed successfully before proceeding with importing the “adl:issue”. That way all collections for the issues are created before running the import of issues.
Logical Conflicts
In default Bulkrax, we have a method
collections_created?
. This is used as a guard clause for importers (via theBulkrax::ImportBehavior#build_for_importer
method).For the default Bulkrax::OaiEntry that is based on either the magical set named “all” and/or that there are only 1 elements in the
#collection_ids
. Which are found by theBulkrax::OaiEntry#find_collection_ids
method.In ADL:Issues import stuck in pending · Issue scientist-softserv/adventist_knapsack#460 · scientist-softserv/adventist-dl, we have a state where Katharine has imported the “adl:periodical” set. And is now attempting to import the “adl:issue” set. These are stuck. Looking further at the exceptions, we’re encountering a
RuntimeError
that is not bubbling up to the importer.That
RuntimeError
is hitting theBulkrax::CollectionsCreatedError
. In other words, for this record, we don’t have a created collection; even though we have parsed metadata of =member_of_collections_attributes: {"0"=>{"id"=>"ffd7d24c-2d65-4390-8394-6c33e01e2cbe"}}=… which appears to exist (see Atlantic Union Gleaner // Adventist Digital Library).Looking at the CSV Situation
The
Bulkrax::CsvEntry#collections_created?
always returns true. So we are not likely to encounter the same problem.Proposal
Reviewing this, we clearly have a case where logic is incorrect or odd. My suspicion is that the
Bulkrax::OaiEntry#collections_created?
logic is leaky. We’re not working with the “all” collection, so instead we say collections were created if we have one and only one associated “collection_id”. But, based on our example “adl:issue” record, we have tworecord header setSpec
elements.