internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.14k stars 1.35k forks source link

Video Tutorial of Implementing a Trusted Book Provider (e.g. Cita Press) #8462

Open mekarpeles opened 11 months ago

mekarpeles commented 11 months ago

Background

OpenLibrary.org is a catalog of every book published and where possible, it links to sources where you can access books to read or borrow.

Many of these books are fulfilled by the Internet Archive's book lending library program. In addition, Open Library links to many vetted partner book sources like Project Gutenberg, Librivox, Standard Ebooks, OpenStax, and others through what it calls its Trusted Book Providers (TBP) program.

We have a form where organizations can apply to be considered for the TBP program.

Currently, each TBP is added to the website manually and this involves a few steps.

  1. Writing a python script to load the metadata from the partner into a format that can be imported into Open Library
  2. Extending the Open Library TBP code to register a new source
  3. Using the openlibrary-client with step [1] to submit these records for import

The program is detailed here: https://openlibrary.org/trusted-book-providers. There's also a blog post here with more details.

Describe the problem that you'd like solved

Something that would be really helpful is a video recording of implementing a Trusted Book Provider. It should only take 1h (assuming the data is all in one place). Citapress may be a good place to start. Once we have a single example, others should be able to relatively easily add new sources which could make a big impact.

We have permission from Cita to pursue and integration and all their data and books are available from citapress.org and http://citapress.org/page-data/index/page-data.json

Additional context

Here's a script which fetches the publisher's catalog and begins to map Cita press's books (http://citapress.org/page-data/index/page-data.json) to Open Library's import schema (https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json):

import requests

r = requests.get('https://citapress.org/page-data/index/page-data.json')
data = r.json() # we load the fetched website data into python as json which we can manipulate like a dictionary
books = data['result']['data']['allMarkdownRemark']['nodes']

# We need these books to be converted into this format: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json
for book in books:
  openlibrary_book = {
      'title': book['frontmatter']['title'],
      'description': book['frontmatter']['description'],
      'cover': '???',
      "source_records": 'citapress',  # ignore, leave as is :)
      "authors": [],  # might be missing from the data!
      "publishers": [],
      "publish_date": ''
  }
  print(openlibrary_book)

Examples & References

Here's an example of importing...

Criteria

This issue may be closed when we have:

  1. a video recording end-to-end (hackathon style) of adding a new Trusted Book Provider
  2. when we've added a provider PR for Cita
  3. when there's additionally some wiki document/tutorial which explains the process, references PR examples, etc (see: the examples listed in https://openlibrary.org/trusted-book-providers)

Stakeholders

mekarpeles commented 9 months ago

Here's the video! https://archive.org/embed/openlibrary-tour-2020/2024-trusted-book-providers-walkthrough.mp4

mekarpeles commented 9 months ago

Here's the corresponding PR https://github.com/internetarchive/openlibrary/pull/8682/files

mekarpeles commented 9 months ago

Here's documentation on the process: https://docs.google.com/document/d/1EGGayYCFWYapy8icTx97afxhmA8_mjeRPJm_TekBNLg/edit?pli=1

Billa05 commented 9 months ago

Hi @mekarpeles, I've watched the video and I now understand how everything is connected. We just need to add the import script for Cita Press to become a TBP. I can handle the task since it's similar to issue #8551, once @cdrini approves the PR first.