internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Create LibriVox import API method #8105

Open cdrini opened 1 year ago

cdrini commented 1 year ago
hornc commented 1 year ago

@cdrini We should be able to do this from the archive.org item where it's hosted.

I recently added LibriVox ids to archive.org items (in the LibriVox uploader tool), and have a tool to generate MARC records from those items which I have been using to generate a KBART resource file. See InternetArchive_Global_LibriVox_2023-05-29.zip on https://archive.org/download/internetarchive-kbart

It's not quite fully automated yet (at least not regularly scheduled to catch new addition), and I probably should upload the MARC records to the items.

This sounds like I should include an automatic import to OL in that process.

I don't think we need a new librivox endpoint, but we can use the existing MARC from archive.org import process to get the same result, and the whole process should be regularly scheduled as part of a LibriVox / archive.org update process.

cdrini commented 1 year ago

Oh ty for adding the librivox ID to IA!! That was a huge source of headaches when I've investigated this in the past and will make a bulk import significantly easier! Do all IA Librivox Items have the call number set?

LibriVox will likely be the first example of a few different Trusted Book Providers that will have a custom import flow. Others include places like Wikisource, Standard Ebooks, OpenStax, etc. Just like we have an /import/ia endpoint to import from IA, I want to start generalizing to have /import/{book_provider} so that we can import from more places that provide books!

I'd like to keep this flow not very dependent on IA (since a lot of Trusted Book Providers won't have IA records) or MARC (since those can be a little complicated for especially OL contributors to work with/understand).

Further in the future we'll schedule automatic update flows from any generic book provider to OL. It is unfortunately very similar to some of the work for adding records to IA, but it has to be different since OL can import things/data IA doesn't want.

cdrini commented 1 year ago

To create that KBART file/get that data, did you crawl the librivox site @hornc ? Or did you find a librivox data dump?

hornc commented 1 year ago

@cdrini the KBART is generated from archive.org metadata only, it represents archive.org holdings, and archive.org hosts all of the LibriVox audio books.

The lack of LibriVox id made it very hard to automate anything, which is why I added it. I filled in all existing LibriVox audiobooks so the ia items should all point back to the LibriVox metadata

JohannSuarez commented 1 year ago

Hello! I'd like to take a crack at this issue. I'll be keeping in touch with cdrini

github-actions[bot] commented 9 months ago

Assignees removed automatically after 14 days.