pulibrary / lib_jobs

Enterprise Services batch processing tasks. Rails 7 Ruby 3.1.0
4 stars 0 forks source link

Archive POD Dataset for Download #395

Closed kevinreiss closed 1 year ago

kevinreiss commented 1 year ago

As a Library Community Member I would like to download a compressed version of Princeton's publicly shareable MARC data for analysis and potential re-use.

Concrete example: I would like download a set of MARC in order to analyze the Princeton collection's coverage in certain subject areas.

Notes We currently have a link to our bibdata dump files on this page https://catalog.princeton.edu/dataset. The link there should be replaced to a location where you can download the compressed POD data set. Ideally we'd also display the date the dump was generated on and refresh the data every month.

Questions

  1. Can a task "start' a dump of our POD dataset in Alma?
  2. Do we need to create a new publishing profile outside of the existing POD one to accomplish this?
maxkadel commented 1 year ago
maxkadel commented 1 year ago

What is the urgency of this ticket? Are there people waiting for this data currently, or is it more of a "nice-to-have"?

What kind of user experience is expected for this issue? Would this require new UI work? What kind of technical expertise do we expect a user to need to have to use this feature? See also https://github.com/pulibrary/orangelight/issues/2825

kevinreiss commented 1 year ago

I think this falls into the due diligence category of work. We'd like to support an easy way to access this data to meet the goal of folks who might want to utilize our Open data. The best and most logical way for folks to get this data consistently would be from a public interface provided by the POD project directly, but I don't think the decision to make that open is happening any time soon. I think the goal for us to close OL#2825 and this ticket would be to have a stable URL someone could be pointed at where they can download a compressed set of our most recent full dump of data in marcxml.

christinach commented 1 year ago

We discussed in our stand up 11/17/2022 to work on this ticket after Thanksgiving.

christinach commented 1 year ago

@kevinreiss will discuss with @escowles to have one archived full dump in a different page in lib_jobs or via the POD project page.

kevinreiss commented 1 year ago

No plans in the forseeable future to open the POD project apis to data consumers outside of the POD project.

kevinreiss commented 1 year ago

Discuss possibly using the POD API to grab a set of the current MARC data we are exposing as "Open Data". Two options in the API:

  1. https://github.com/pod4lib/aggregator/wiki/Harvesting-using-ResourceSync
  2. https://github.com/pod4lib/aggregator/wiki/Harvesting-using-OAI-PMH

One of these approaches could be an alternative to working with a full dump event of the current POD publishing process or creating a new publishing process.

kevinreiss commented 1 year ago
maxkadel commented 1 year ago

Jane and I created a decision document to try to come to some conclusions on some of the questions that have come up around this issue. Contact Max or Jane if you cannot edit and want to.

sandbergja commented 1 year ago

This work is complete! @kevinreiss will open a new issue for refreshing the data regularly.