internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.1k stars 1.33k forks source link

Allow librarians to import MARC data from other libraries #8360

Open onnotasler opened 11 months ago

onnotasler commented 11 months ago

When entering new books or editing existing books, I often have to manually copy from libraries that offer a MARC record for download. It would be great if I could directly import this data instead of having to typing it.

As an example, take Das Postwesen im Postamtbezirk Buxtehude. This book exists as a really low quality import on Open Library at OL26425107W The Deutsche Nationalbibliothek offers most of the lacking information on their website. It offers downloads as MARC21-XML and RDF (Turtle).

The DNB is not the only national libraries offering this, even though the formats differ between libraries. The Bibliothèque nationale de France offers Intermarc and Unimarc instead, for instance. LIBRIS (National Library of Sweden) offers MARC21.

It would save me time and prevent spelling errors if I could import those datasets.

Describe the problem that you'd like solved

A way to import MARC records from National Libraries, to at least improve existing records, but ideally also to create new books.

Proposal & Constraints

As far as I understood, Open Library already imports MARC records from some libraries. At least I often read "imported by MARC record from library of ..." at the bottom of editions.

The import should not be more annoying than typing the stuff in manually. Also, there seems to be a lot of technical differences between different MARC versions - I probably won't be able to get up to speed in all of them, this would have to be handled automatically.

Additional context

Stakeholders

@hornc

LeadSongDog commented 11 months ago

Really, this should have been addressed long ago. Once a unique external ID such as ISBN or OCLCn has been furnished, the ImportBot ought not settle for just one repository’s record, but either select the most complete one available from a reliable library, or even better, fuse them together to fill in any blank fields. Certainly not a good plan to be stuck indefinitely with whatever little bit AMZ or BWB furnished.

Koenisegg484 commented 10 months ago

Hi @hornc I would like to work on this issue, Could I get some pointers on how shall I start as this is my first contribution.

mekarpeles commented 10 months ago

It seems like the ask is: Ability to upload/submit a MARC record to Open Library

We have a pipeline for importing MARCs to Open Library, backed by Archive.org items which is described here: https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#MARC-Records

Also, there is a MARC option in the openlibrary.org/api/import path...

This doesn't seems like a fantastic match for a first project by a community contributor. If someone did want to work through this, the solution would likely be...

To create a librarian-only UI where a contributor with librarian permission group can upload a MARC record which gets submitted to our import process using the MARC format of parse_data: https://github.com/internetarchive/openlibrary/blob/c792a2f854d0ff2912e2622f322fa597c034a1c8/openlibrary/plugins/importapi/code.py#L117-L133

hornc commented 10 months ago

I agree with @mekarpeles that this is probably a bit tricky for a first time contributor.

I had been meaning to respond with a summary of the two options mentioned above where we do have MARC imports already.

The bulk import process could be used to import a single record, but that's a bit fiddly and involves creating a new archive.org item. Depending on the source though, if MARC records are available publicly, there might be a way to import an entire collection rather than a few books one by one. Is that a possibility here?

The API should work to import a single record in one go, but I have not looked at this in a while. I don't think the single import API will store the MARC record anywhere, which is less useful than it could be. Open Library does not store MARC records, they are all on archive.org as single records stored on a scanned item, or part of a larger bulk-data MARC collection. Single MARC records without corresponding scans is not handled well / at all (if I remember correctly).

The work around has been to only import bulk collections, which gives many new books, and records the source.

Three options:

  1. Use the existing bulk import API because we can get more records from this source (I don't know if that's possible or better than the original request)
  2. Figure out the existing API instructions in way that satisfy the request. The API is there, but is mostly unused.
  3. Implement a new librarian UI interface to the existing APIs, if the first two options aren't sufficient as is.
onnotasler commented 10 months ago

Depending on the source though, if MARC records are available publicly, there might be a way to import an entire collection rather than a few books one by one. Is that a possibility here?

The free MARC records I found were all limited to a single edition of a single work. With the tools and knowledge I have, I can only download and process one edition at a time. If it is possible to import the whole catalogue at once, that would definitely be better.

At least the Deutsche Nationalbibliothek has an Bezugswege und Exportformate entry on their homepage, and they seem to offer their whole catalogue in several different files formats:

They also offer a long list of formats and APIs, but I lack the technical expertise to comment on them.

hornc commented 10 months ago

@onnotasler There's an issue for DNB data here: https://github.com/internetarchive/openlibrary-bots/issues/29 I have prepared the data and made a start on importing. I stopped because of the various discussion about import data quality, and have not yet resumed importing. This is something I can turn back on again if there is demand.

onnotasler commented 9 months ago

I do not insist on a MARC importer if I can instead get the books imported in bulk, but in that case we should implement a way to suggest sources for bulk data instead.