Some background: The curation team currently uses Airtable to manage the review and curation of content deposited to ScholarSphere. The Airtable is fed by a GitHub Action (hosted here) that is scheduled to run each day:
It looks for submissions deposited in the last 60 days and creates/updates entries in Airtable.
Additional metadata that is useful for curation and not available in the catalog (the depositor name and id) is retrieved through the graphql endpoint: https://scholarsphere.psu.edu/api/public
This system mostly works, but there are some problems:
The catalog feed doesn't include embargoed works, so these items don't end up in Airtable (and may not be reviewed).
The only timestamp available in the catalog (deposited_at_dts) corresponds to the date a work's first draft version was created (not when it was published). Because only published items appear in the catalog, we miss works that were published >60 days after the first version was created.
There is no mechanism to receive notification about new published versions.
These APIs aren't really "supported" (i.e., documented, with tests), so the workflow could easily break.
An ideal curation "feed" would:
Include embargoed works
Include the following metadata: id, title, depositor id & name, depositing application id & name.
include and be sortable by the recently addedpublished_at timestamp. (Each time a work version is published, the work should move to the top of the feed).
Be documented and tested to reduce the likelihood of breaking changes to the integration.
Some questions and possible solutions:
Collections don't have the published_at timestamp, so how are they included in the feed? One possibility is to simply not include them in the feed since curation mostly focuses on the underlying works. Alternatively, we can present a collection's updated_at as its published_at.
Change the existing catalog (json feed) or create a new feed just for curation? I think it would be OK -- an improvement actually -- to include embargoed works and change the default sorting to published_at for the catalog as a whole. However, removing collections from the catalog would be bad.
Some background: The curation team currently uses Airtable to manage the review and curation of content deposited to ScholarSphere. The Airtable is fed by a GitHub Action (hosted here) that is scheduled to run each day:
This system mostly works, but there are some problems:
deposited_at_dts
) corresponds to the date a work's first draft version was created (not when it was published). Because only published items appear in the catalog, we miss works that were published >60 days after the first version was created.An ideal curation "feed" would:
published_at
timestamp. (Each time a work version is published, the work should move to the top of the feed).Some questions and possible solutions:
published_at
timestamp, so how are they included in the feed? One possibility is to simply not include them in the feed since curation mostly focuses on the underlying works. Alternatively, we can present a collection'supdated_at
as itspublished_at
.published_at
for the catalog as a whole. However, removing collections from the catalog would be bad.