internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.13k stars 1.34k forks source link

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Closed LeadSongDog closed 1 month ago

LeadSongDog commented 5 years ago

Many edition records have no publisher shown, but do have an ISBN. Previous discussion at #895 shows how to get an official spelling of the publisher from the ISBN.

hornc commented 5 years ago

To give an idea of the scope of this:

in the May 2019 edition dump there are

grep -cv '"publishers":' ol_dump_editions_2019-05-31.txt 1,189,309 editions without publishers. 169,759 of those have ISBNs.

LeadSongDog commented 1 month ago

So #895 just closed without changing code, but the underlying idea here, to exploit ISBN prefixes to fill in blank publisher fields, still has potential to quickly improve data. Can we revisit it? I see no reason why the prefixes could not become readily searchable, yielding a standard spelling for the publisher, and even an approximate year of publication (as adjacent ISBNs will usually be assigned in the same year).

scottbarnes commented 1 month ago

@LeadSongDog, can you tell me a bit more about the strategy here?

I tried the links in #895, but they no longer seem to work.

Using https://openlibrary.org/books/OL3697910M/General_chemistry as an example, it has ISBN 13 9780618399413, but, to ask a stupid question, how do I determine the prefix? I tried querying for 968-0-618, and saw a lot of publishers, some of which match the already listed publisher, and not every prefix is the same length I see: https://grp.isbn-international.org/search/piid_solr?keys=978-0-618+%28ISBNPrefix%29.

My goal is to understand the process so the issue can be better broken down into steps that can be used to close the issue.

tfmorris commented 1 month ago

To update the 5 year old numbers above https://github.com/internetarchive/openlibrary/issues/2119#issuecomment-505781155, there are currently 2281048 (2.3M) editions without publishers, double the number from 5 years ago, and 513197 (0.5M) of those have ISBNs.

LeadSongDog commented 1 month ago

Only the vaguest ideas on implementation, but…

One might start from an edition record with isbn but not a publisher identified. Searching on minimally truncated versions of an isbn returns something like these:

https://openlibrary.org/search?q=isbn%3A+97806183994*&mode=everything or https://openlibrary.org/search?q=isbn%3A+9780618399*&mode=everything or https://openlibrary.org/search?q=isbn%3A+978061839*&mode=everything

A quick comparison of the results shows that the closest ISBNs had the most similarities, even revealing variant spellings for the publisher and authors. The shortest (first) list above includes several spellings for Houghton Mifflin seen at Q390074.

It might be simpler to start from a dump of editions, then sort on isbn?

scottbarnes commented 1 month ago

I know I am being a bit slow here, but the part that isn't fully clear to me is how to get from the ISBN to the publisher, or at least to the prefix. I think we might need to be able to do that on a large scale to do it on the data dump.

hornc commented 1 month ago

@scottbarnes to answer the 'how to get the prefix', you can use the isbnlib in Python:

>>> import isbnlib
>>> isbnlib.mask('9780618399413')
'978-0-618-39941-3'

and get all but the last two groups.

The prefixes are assigned by a registry, and the ranges are updated every so often. isbnlib keeps these relatively up-to-date.

Your example for 968-0-618 is interesting. I'm surprised that it returns "Clarion Books" (a childrens book publisher) and the more correct looking Houghton Mifflin.

I was going to say that doesn't make sense to me, unless it is different imprint levels that are owned by the same parent company, but that might be what is happening here. Clarion Books is owned by Harper Collins, and Harper Collins has bought Houghton Mifflin at some point, so maybe that range has been transferred? This might make it more difficult to extract the original publisher (because they keep eating each other).

It seemed like a reasonable approach to extract publishers from ISBN prefixes, but your example seems to show that this can change over time. Even without this complicating factor, an ISBN prefix lookup might give results at a different imprint level, which may not be that useful for someone searching for bibliographic publisher metadata.

That leads to the question: what use-case does populating publisher from ISBN serve?

OL doesn't use publisher to disambiguate between books, primarily because publisher is just an uncontrolled string, and there is so much variations in forms and imprints it doesn't add much. Any item with an ISBN already has that as a more unambiguous id for any kind of look up.

The publisher string determined by an ISBN look is probably quite likely to not appear on the book in that form.

LeadSongDog commented 1 month ago

Particularly for textbooks, some work titles, such as « Chemistry » or « Calculus » are reused by multiple works. To merge the work records, they must be disambiguated too. By identifying the publisher of an edition, it is often possible to (A)determine more completely the author(s) for each work (as cataloguers and online merchants often only list author surnames) and (B)determine which synonymous work the edition is from. Work-merging will still require the merged work records to agree on linked authors.

tfmorris commented 1 month ago

Like @hornc, I'm suspicious of this approach. It doesn't seem like a reliable way to source metadata. The original problem (no publisher stated) is created by using poor quality metadata to start with, so let's not compound it.

To take a random example from early in a recent edition dump https://openlibrary.org/books/OL11812091M It was originally imported from a threadbare Amazon page, which is where the trouble started https://www.amazon.com/gp/product/0976511037 The ISBN prefix is registered to "University Book Exchange" https://grp.isbn-international.org/search/piid_solr?keys=978-0-9765110 but the copyright page lists "Independent Press," ttps://archive.org/details/whengreekgoatssi0000ceru/page/n3/mode/1up the same thing that WorldCat has: https://search.worldcat.org/title/437125764?tab=details and which appears in the MARC record that IA has associated with it: https://ia803401.us.archive.org/fetchmarc.php?path=/0/items/whengreekgoatssi0000ceru/whengreekgoatssi0000ceru_marc.xml

I'm not sure why the IA MARC record wasn't used to populate the publisher, but it strikes me that even if it weren't available, using WorldCat would be better than trying to guess from the ISBN.

LeadSongDog commented 1 month ago

@tfmorris That’s an interesting example.

Of course I agree that low quality metadata sources (goodreadsss, AMZ, BwB) should not be amplified, but that still seems to be accepted OL practice. I’d prefer to simply delete what can’t be verified, but I can’t. Would you prefer to have us just leave the mess rather than clean it up?

The Promise record was attached to the edition long after the book had been scanned into IA: https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff While the scan correctly showed « Independent Press » as the publisher, the Promise record incorrectly showed the ECU bookstore « University Book Exchange ».

The problem was aggravated in that the low quality source (Promise) was allowed to overwrite the high quality source (scan). There ought to be logic preventing this.

scottbarnes commented 1 month ago

For my part, I am not convinced we can, with confidence, get from the ISBN to the correct publisher. But I do agree that low quality imports should not overwrite higher quality metadata, though actual overwriting of populated fields shouldn't currently be happening, and if it is, I think that is likely a bug to be addressed.

In the case of https://openlibrary.org/books/OL11812091M/When_Greek_Goats_Sing_Sad_Songs?_compare=Compare&b=5&a=4&m=diff, again I apologize for being slow, @LeadSongDog, but I should be upfront about my ignorance to save everyone time: can you help explain the possible harm from adding promise:bwb_daily_pallets_2021-01-26 to source_records, and urn:bwbsku:W1-AUZ-082 the local_id?

We may be getting afield of the specific issue of using the ISBN prefix to populate the publisher. Unless there is way to do this with confidence, I am inclined to close this specific issue.

However, there is still more to do in terms of improving quality. Hopefully the changes in #9587 and #9574, along with the forthcoming changes in #9753 and PRs to address #9808 and #9831 will help limit the light records. It may also be the case that the suggestion in #9808 not to match MARC imports without an ISBN to an existing edition with only title + ISBN should be extended to all imports, but that may be a discussion for elsewhere.

LeadSongDog commented 1 month ago

@scottbarnes No apology needed. We all have learning to do, me more than most. I wonder if recording the Promise pallet associated with the (then unscanned) edition has a point when the IA record from the subsequent scan shows a different “Old_pallet IA-NS-0000662”.

scottbarnes commented 1 month ago

I see that IA-NS-0000662 is listed as one of the pallets at https://archive.org/details/bwb_daily_pallets_2021-01-26, so there seems to be at least some connection. I am unsure of how the exact pallet of the multiple there was associated with https://archive.org/details/whengreekgoatssi0000ceru, however.

I have found it useful at times to use the source record, e.g. bwb_daily_pallets_2021-01-26, to go look at the metadata from the promise item itself to try to understand a bit more about what happened with a particular import.

scottbarnes commented 1 month ago

Pending a way to confidently add publisher records from the ISBN prefix, I am going to close this as not (currently) planned.