Closed amyehodge closed 1 year ago
This is a research / analysis ticket to determine what would be involved in setting up an OAI-PMH endpoint that would support Unpaywall's harvesting requirements above.
I'm confused. Does Unpaywell need the DOI / identifier of the published article? If so, how/where would that be recorded in SDR?
@justinlittman I know that currently we aren't collecting that info in H2. I'm wondering if @arcadiafalcone or @andrewjbtw know of any items that might currently have this metadata in the right format.
I have some preliminary designs for implementing the collection of this information via H2 that I worked on with Rochelle, but that work did not make it into the last work cycle. But if we can get OAI-PMH set up, then when we start collecting the metadata there it can immediately be harvested.
Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.
Also, if I'm not mistaken implementing OAI-PMH would require some sort a datastore. Since PURL doesn't have a datastore by design, this would need to be a completely separate application and would require some mechanism for keeping in sync with publishing.
Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.
There are two examples in the Driver guidelines at https://wiki.surfnet.nl/display/DRIVERguidelines/DC+-+RELATION+and+Linking+related+objects of implementations of the relationship metadata.
There is also a vocabulary for expressing the types of article versions: https://wiki.surfnet.nl/display/DRIVERguidelines/Version+vocabulary
Here's one of the examples for a paper that has been submitted for peer review:
<oai_dc:dc >
<de:identifier>[http://hdl.handle.net/1234/1111]</dc:identifier>
<dc:type>info:eu-repo/semantics/paper</dc:type>
<dc:type>info:eu-repo/semantics/submittedVersion</dc:type>
<dc:relation>[http://hdl.handle.net/1234/2222]</dc:relation>
</oai_dc:dc>
I meant in cocina and/or mods.
@arcadiafalcone do you know the answer to @justinlittman 's question above?
I'm not aware of metadata where the DOI of the published version/version of record is specifically identified in Cocina. I'm sure people have included it in related item links at times but probably not in a way that sets it apart from other related items.
Here's an example where the article links to the published version in the "related items": https://purl.stanford.edu/bw723vz5327 (also links to it in the abstract)
If anyone is interested, I have two related requirements docs for H2 around this point at https://docs.google.com/document/d/1kk-jHgkovZ6ghKxPvmcEcwHthDYwJrnLyAH5ods8YWE/edit and https://docs.google.com/document/d/1Ci5BFpTfhw5QyDWkOfACfyvznm3YkHYOpq3XXsFqlYo/edit. I think the first one is what we would need here for H2 content. But it didn't make it into the last H2 work cycle and may be too complex to include here. It never got to the point of discussion with developers, and maybe not even with Arcadia, so likely needs work still to be actionable.
It would be possible to represent the above example in MODS, if the metadata is collected from the user or derived from the linked resource.
Current status is waiting on information about other aggregators or use cases OAI-PMH would support (now or in the future). @amyehodge is finding out about the frequency of content needing to be updated. Design will be dependent on what the requirements are. Preliminary design discussion included @justinlittman.
Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.
https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ
It's kind of a long shot, but it would be nice!
Notes from OAI-PMH implementation discussion https://docs.google.com/document/d/1LYB_0ynJoHLsEobxJgdhU7q4jWoO0wDi4XF4omOEYTk/edit
I have confirmed that a monthly frequency for updating of the content would be acceptable. She has also expressed interest in support core.ac.uk, and I have received an introduction to those folks and am waiting to hear who my contact there will be to get the technical information we require, since I can't find it on the web.
We need to figure out how to identify the items that are needed for each service before implementation.
Assuming that:
One can imagine an implementation using ruby-oai that:
To index:
The risk in this implementation is ruby-oia's unclear support for sets:
There is some code written to support oai-pmh "sets" in the ActiveRecord::Wrapper, but it's somewhat inflexible, and not well-documented, and as I write this I don't understand it enough to say more. See https://github.com/code4lib/ruby-oai/issues/67
See https://github.com/code4lib/ruby-oai/blob/master/lib/oai/provider.rb#L266C30-L266C76
@amyehodge I'm closing this as the technical analysis is complete. I'd suggest that metadata analysis is still required. If you disagree on closing, feel free to re-open.
That's fine @justinlittman . I haven't had a chance to look at this in detail yet, but I'll try to sort out next steps to move this along. Thanks.
Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.
https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ
I just checked on this note and Unpaywall did respond to say that they don't support collecting data from sitemaps. Thanks for checking @edsu
In particular, we want to test our implementation to make sure content is getting picked up by Unpaywall, which uses OAI-PMH. Rochelle mentioned that "Unpaywall feeds OA publications to Web of Science and Scopus, among others, so having a connection here would make a big impact."
We also want our OAI-PMH implementation to meet the requirements for CORE (https://core.ac.uk). These requirements can be found at https://docs.google.com/document/d/1sc8RSAhJT4kmYxUKgKvNbPSegejosr83VIqOo6CR7mY/edit.
This is some information I collected while researching requirements for Unpaywall that may be useful. The idea was to get them to pick up the OA versions of published articles that we have in SDR. Note that we may not actually have any items that have the necessary metadata, but we may be able to create/enhance a few sample items to have something for testing purposes.