Open rahulsabbineni opened 2 years ago
That’s another way of approaching #372. Suggest you look at sorting on ISBN: publishers buy ISBNs from registrars en block, so within a block similar publisher name spellings could safely be consolidated.
@rahulsabbineni would you be open to helping us create a python bot to fix some of these?
@mekarpeles Sounds good! I've just completed the Google form for access to Slack and I'll take a look at getting the developer environment setup.
I've decided to scrape book data from publishers to create a smaller, curated data set instead of using OL. As such, I will not be able to work on this issue. Please close the issue if it's not important / redundant. Thanks!
Hi! I've been playing around with the OpenLibrary editions dump (
ol_dump_editions_2022-03-29.txt
) and noticed that there were a large number of duplicate publishers.First, I parsed a large subset of the data - 17,684,756 editions and 18,086,971 publishers in those editions - and ingested the data into Postgres. Next, I ran a SQL query to find the top 1000 publishers by edition count. Finally, I used a basic string similarity algorithm
difflib.SequenceMatcher.ratio
on each pair of the top 1000 publishers to find potential duplicates. I found 37 pairs of publishers that were above a 0.90 similarity threshold and have included the data below. The number on the right is the number of editions belonging to the publisher in my dataset.Some of these cases are legitimately distinct publishers (ex: Oxford University Press vs Oxford University Press, USA). However, as you can see from the data, many of the publishers are the same, but have different capitalization, spacing or abbreviations.
Is there a mechanism on Open Library's backend to merge some of these publishers together in the more egregious cases? Could the problem be solved by providing a canonical Open Library ID (similar to works, editions, authors etc) for publishers? I'd be happy to help contribute if there's a lack of capacity to address this particular problem.
Hope the data is at least somewhat helpful. Thanks!