sul-dlss / purl

URL resolver that translates a reference to a digital object in the form of a druid, into a full content representation of that object as available
Other
8 stars 1 forks source link

Analysis to investigate requirement for implementing OAI-PMH #795

Closed amyehodge closed 1 year ago

amyehodge commented 1 year ago

In particular, we want to test our implementation to make sure content is getting picked up by Unpaywall, which uses OAI-PMH. Rochelle mentioned that "Unpaywall feeds OA publications to Web of Science and Scopus, among others, so having a connection here would make a big impact."

We also want our OAI-PMH implementation to meet the requirements for CORE (https://core.ac.uk). These requirements can be found at https://docs.google.com/document/d/1sc8RSAhJT4kmYxUKgKvNbPSegejosr83VIqOo6CR7mY/edit.

This is some information I collected while researching requirements for Unpaywall that may be useful. The idea was to get them to pick up the OA versions of published articles that we have in SDR. Note that we may not actually have any items that have the necessary metadata, but we may be able to create/enhance a few sample items to have something for testing purposes.

lwrubel commented 1 year ago

This is a research / analysis ticket to determine what would be involved in setting up an OAI-PMH endpoint that would support Unpaywall's harvesting requirements above.

justinlittman commented 1 year ago

I'm confused. Does Unpaywell need the DOI / identifier of the published article? If so, how/where would that be recorded in SDR?

amyehodge commented 1 year ago

@justinlittman I know that currently we aren't collecting that info in H2. I'm wondering if @arcadiafalcone or @andrewjbtw know of any items that might currently have this metadata in the right format.

I have some preliminary designs for implementing the collection of this information via H2 that I worked on with Rochelle, but that work did not make it into the last work cycle. But if we can get OAI-PMH set up, then when we start collecting the metadata there it can immediately be harvested.

justinlittman commented 1 year ago

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

justinlittman commented 1 year ago

Also, if I'm not mistaken implementing OAI-PMH would require some sort a datastore. Since PURL doesn't have a datastore by design, this would need to be a completely separate application and would require some mechanism for keeping in sync with publishing.

justinlittman commented 1 year ago

https://github.com/code4lib/ruby-oai

amyehodge commented 1 year ago

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

There are two examples in the Driver guidelines at https://wiki.surfnet.nl/display/DRIVERguidelines/DC+-+RELATION+and+Linking+related+objects of implementations of the relationship metadata.

There is also a vocabulary for expressing the types of article versions: https://wiki.surfnet.nl/display/DRIVERguidelines/Version+vocabulary

Here's one of the examples for a paper that has been submitted for peer review: <oai_dc:dc > <de:identifier>[http://hdl.handle.net/1234/1111]</dc:identifier> <dc:type>info:eu-repo/semantics/paper</dc:type> <dc:type>info:eu-repo/semantics/submittedVersion</dc:type> <dc:relation>[http://hdl.handle.net/1234/2222]</dc:relation> </oai_dc:dc>

justinlittman commented 1 year ago

I meant in cocina and/or mods.

amyehodge commented 1 year ago

@arcadiafalcone do you know the answer to @justinlittman 's question above?

andrewjbtw commented 1 year ago

I'm not aware of metadata where the DOI of the published version/version of record is specifically identified in Cocina. I'm sure people have included it in related item links at times but probably not in a way that sets it apart from other related items.

andrewjbtw commented 1 year ago

Here's an example where the article links to the published version in the "related items": https://purl.stanford.edu/bw723vz5327 (also links to it in the abstract)

amyehodge commented 1 year ago

If anyone is interested, I have two related requirements docs for H2 around this point at https://docs.google.com/document/d/1kk-jHgkovZ6ghKxPvmcEcwHthDYwJrnLyAH5ods8YWE/edit and https://docs.google.com/document/d/1Ci5BFpTfhw5QyDWkOfACfyvznm3YkHYOpq3XXsFqlYo/edit. I think the first one is what we would need here for H2 content. But it didn't make it into the last H2 work cycle and may be too complex to include here. It never got to the point of discussion with developers, and maybe not even with Arcadia, so likely needs work still to be actionable.

arcadiafalcone commented 1 year ago

It would be possible to represent the above example in MODS, if the metadata is collected from the user or derived from the linked resource.

lwrubel commented 1 year ago

Current status is waiting on information about other aggregators or use cases OAI-PMH would support (now or in the future). @amyehodge is finding out about the frequency of content needing to be updated. Design will be dependent on what the requirements are. Preliminary design discussion included @justinlittman.

edsu commented 1 year ago

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

It's kind of a long shot, but it would be nice!

amyehodge commented 1 year ago

Notes from OAI-PMH implementation discussion https://docs.google.com/document/d/1LYB_0ynJoHLsEobxJgdhU7q4jWoO0wDi4XF4omOEYTk/edit

amyehodge commented 1 year ago

I have confirmed that a monthly frequency for updating of the content would be acceptable. She has also expressed interest in support core.ac.uk, and I have received an introduction to those folks and am waiting to hear who my contact there will be to get the technical information we require, since I can't find it on the web.

lwrubel commented 1 year ago

We need to figure out how to identify the items that are needed for each service before implementation.

justinlittman commented 1 year ago

Assuming that:

  1. The purl filesystem is available on the OAI-PMH server for indexing.
  2. Either the existing DC metadata could be used or a mapping could be created to the metadata formats required by CORE and Unpaywall from MODS or Cocina.
  3. The items to be included in the CORE and Unpaywall sets can be identified in the Cocina or public XML.

One can imagine an implementation using ruby-oai that:

To index:

  1. The purl filesystem would be crawled. For each crawled item, if a record already exists it will be updated. If it does not exist and the item is in any set, a record will be created. If it does not exist and is not in any set, no record will be created or updated.
  2. When creating or updating a record, the set membership (included, not included, deleted) for each set is recorded and any necessary metadata is pre-generated and stored.
  3. For each existing record which has not been recently crawled (as determined by the updated timestamp) and is the member of any sets, the purl filesystem will be checked for the item. If the item exists, the record will be updated as described above. If the item does not exist, the set memberships will be set to deleted.

The risk in this implementation is ruby-oia's unclear support for sets:

There is some code written to support oai-pmh "sets" in the ActiveRecord::Wrapper, but it's somewhat inflexible, and not well-documented, and as I write this I don't understand it enough to say more. See https://github.com/code4lib/ruby-oai/issues/67

See https://github.com/code4lib/ruby-oai/blob/master/lib/oai/provider.rb#L266C30-L266C76

justinlittman commented 1 year ago

@amyehodge I'm closing this as the technical analysis is complete. I'd suggest that metadata analysis is still required. If you disagree on closing, feel free to re-open.

amyehodge commented 1 year ago

That's fine @justinlittman . I haven't had a chance to look at this in detail yet, but I'll try to sort out next steps to move this along. Thanks.

amyehodge commented 10 months ago

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

I just checked on this note and Unpaywall did respond to say that they don't support collecting data from sitemaps. Thanks for checking @edsu