internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.13k stars 1.34k forks source link

Multiple manifestations of single edition handled poorly #2303

Closed tfmorris closed 3 months ago

tfmorris commented 5 years ago

Description

The edition below was scanned twice by the Internet Archive, which is arguably redundant and wasteful of resources, but given that that they do it, we should be able to handle it. ImportBot added a second IA identifier to the edition record, but the edition page only displays the first identifier, making the second scan invisible.

Relevant url?

https://openlibrary.org/books/OL24656394M/L'%C3%A9volution_du_dogme https://openlibrary.org/people/ImportBot

Expectation

We either need to be able to handle multiple IA scans per edition record or create duplicate edition records with one scan per record. The latter is singularly unattractive from a metadata point of view, but handling the former may require adjusting how IA does "loans" for books which are still in copyright since currently availability and loans are tied to edition records.

Proposal & Constraints

It's more work, but I'd lean towards adding the concept of "copies" to our metadata model with multiple copies per edition.

Stakeholders

@mekarpeles @hornc

tfmorris commented 5 years ago

This was discovered while reviewing the duplicates in https://github.com/internetarchive/openlibrary/issues/1620#issuecomment-521285745

cdrini commented 5 years ago

I agree; I think the best long-term solution would be to have a new type, maybe Copy (or Scan or Digitization), and then have a one-to-many relationship from Edition to Copy. Then each IA id would be associated with a single Copy.

LeadSongDog commented 5 years ago

So your "copy" would be approximately equivalent to frbr:item where "edition" is approximately frbr:manifestation ? Bringing something of the sort into use would be helpful. It would be useful to determine the referent level for each common identifer type: does it pertain to a single item/copy, or to all similar items/copies (the manifestation).

hornc commented 4 years ago

I'm not sure this is an import issue, and not a bug. It looks like the importer did the correct thing here by associating a newly discovered scan with the correct OL record.

I don't think creating a new edition, or new class of thing, is correct. If anything, there is potentially a UI / UX change required as a new feature, but a supporting use-case would help.

From a metadata POV, this seems like the correct behaviour. Source records is a list, and all sources are listed. We only provide one link out to a scan with the ocaid field, and we've got one. Either should be 'good' copies, and there isn't currently an automated way to chose a 'best' one. This example is a public domain scan, so there isn't a availability problem. There might be for borrowable items, but archive.org already has this problem.

tfmorris commented 4 years ago

The example is a public domain scan, but this applies to copyrighted scans that Internet Archive is going to "lend" as well. If they've scanned multiple editions, users can join the wait list for any of them (actually they have to do it individually), but if the same edition is scanned twice, only one of the copies can be lent out through OpenLibrary.

@hornc Are you proposing that we just refuse to support lending multiple scans of the same edition?

LeadSongDog commented 3 years ago

Multiple scans of a single physical copy also represent a way to reduce OCR errors, but that’s an IA process, not OL.

LeadSongDog commented 3 months ago

It seems that whilst an edition can link multiple IA “source record”s, only one will be displayed as the ocaid. It isn’t obvious how the selection of the one to display is made.

compare https://openlibrary.org/books/OL7132707M/Principles_of_microbiology?v=8 to https://openlibrary.org/books/OL7132707M/Principles_of_microbiology.json?v=8

hornc commented 3 months ago

This issue has changed title in a way that changes the meaning. The original was about copies of the same manifestaion (FRBR: "items", which was clarified further up), now the title states "manifestation".

There are possibly also problems with handling manifestions poorly, but that has not been made clear.

I stand by my comment of Feb 28, 2020. Technical details are prompting offers of model changes and additions while entirely skipping discussion of what usecases are and aren't supported, and the reasons behind any decision.

There seems to be an (understandable) conflation of two different fields, but they have slightly different meanings, and quite different purposes:

The usecase

  1. "As an OL patron, in order to access the book I am browsing for, I want the ability to borrow a digital copy from the record I locate."

is met by the ocaid single value field and borrow button.

Also,

  1. "As a OL / archive.org librarian, in order to cross-check metadata or link other scans by work, I want to be able match OL editions records with archive.org (and other) metadata records."

Is met by simply having the multiple values in the source_records field. The fact the UI doesn't display them doesn't really matter. Data dumps, JSON API, and search use them, so they fulfill their purpose for the users that require it.

What isn't met is this:

  1. "As a Open Librarian who cares about efficient utilization of available copies (sourced from archive.org specifically?), I want the record availability button to reflect all possible copies we know about."

This would suggest expanding the ocaid field to a list, or perhaps making more use of source_records if they are IA sourced, if 3. is the desired goal.

There is likely an OL patron centric version of this too, but it's not clear that unavailable books with suitable borrowable alternative copies is a widespread problem being reported? It might be, and that could have a different set of solutions. The current phrasing seems more about hypotheticals.

Case 2. Could be expanded by making all the source records more visible, or more sensibly laid out on the UI because basically the UI doesn't support this usecase at all, but JSON and APIs etc seem sufficient and it's not clear that is a complaint.

Case 3. doesn't seem to be a priority either because 1. covers the basic functionality.

If someone were to pursue 3, it doesn't necessarily require a new modeled class of object which was the suggestion which came before a clear statement of the problem.

LeadSongDog commented 3 months ago

@hornc Thank you for the careful explication, however it seems to miss something: each of the source records represents distinct scans and has distinct catalogue metadata. Even supposing the source records correctly describe the same edition, obscuring some of the source records from view means the additional metadata is not reliably and verifiably reflected in the corresponding edition record. Further there is no assurance that the one source record used as the OCAID is in any way the best of those available to choose from.

hornc commented 3 months ago

Well I'm saying the source_records in general do not represent scans, distinct or otherwise, and there is no guarantee the catalogue metadata they do represent are distinct. It might be distinct, it might be identical, it might be the same record at different snapshots in time. It might be the same MARC record hosted by different organisations.

ia: prefixed source_records happen to be distinct archive.org identifiers and will have distinct scans, but that's not the purpose of the field. There are likely more non-scan source records than scan ones.

The only claim you can safely make about a source_record value is that the metadata associated with it represents the same "edition" as the Open Library record. That's the intent at least. It was either matched, or used to build the OL record.

Your point "obscuring some of the source records from view means the additional metadata is not reliably and verifiably reflected in the corresponding edition record" is an argument that usecase 2 isn't being fulflilled properly. I don't disagree, but I personally don't find it blocking because I know how to get that data via the raw JSON when I need it, and that's the main way I interact with it anyway. Is fixing that sufficient to resolve this issue?

There is no assurance that the one source record used as the OCAID is in any way the best of those available to choose from.

Is trivially true because there is no assurance that any value of OCAID is the best available. The current solution is to edit in a better one if it is known.

Choosing the best is still a subjective manual edit, nothing is preventing it currently as far as I know.

If I have a point here, it's that this issue isn't clear about what new feature or usecase is being requested, or about what current feature or usecase isn't working.

I've offered a few usecases I think are relevant to help clarify or narrow in on what usefully could be addressed here, if anything. There are likely others which could be suggested.

I don't feel there is an agreement or even shared understanding of what the problem is. Different commentators seem to have different impressions on what this means. The title can't even stay consistent over time. Sure, things could be done better, but without a clear statement of what and why they don't tend to move. Tying issues or feature requests back to some clear statement of value (usecases are great for this) helps clarify for everyone and allows potential solutions to be evaluated. Many issues here suffer from the same problem, that's why there are over 700 still open.

LeadSongDog commented 3 months ago

I was not clear enough, sorry. Disregarding the issue title for a moment, the issue description initially raised by @tfmorris identifies the problem. Quite simply, the edition page fails to show all the information present in the edition record. That failure means that any error detection and correction can only be done by those few people who look at the obscure json, rather than by the many who look at the edition page.

The provision of a clickable link to those other source records would enable the rest of us (patrons and ordinary contributors) to see those other source records and to utilize both the metadata and the scanned images to which those links resolve. This would be a step in the right direction, but all the other aspects of the record should eventually be made visible on the edition page too.