internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5k stars 1.26k forks source link

Large number of duplicate publishers #6570

Open rahulsabbineni opened 2 years ago

rahulsabbineni commented 2 years ago

Hi! I've been playing around with the OpenLibrary editions dump (ol_dump_editions_2022-03-29.txt) and noticed that there were a large number of duplicate publishers.

First, I parsed a large subset of the data - 17,684,756 editions and 18,086,971 publishers in those editions - and ingested the data into Postgres. Next, I ran a SQL query to find the top 1000 publishers by edition count. Finally, I used a basic string similarity algorithm difflib.SequenceMatcher.ratio on each pair of the top 1000 publishers to find potential duplicates. I found 37 pairs of publishers that were above a 0.90 similarity threshold and have included the data below. The number on the right is the number of editions belonging to the publisher in my dataset.

Independently Published                                      |  707496
Independently published                                      |  100073

CreateSpace Independent Publishing Platform                  |  152920
Createspace Independent Publishing Platform                  |   67325

Oxford University Press                                      |   94952
Oxford University Press, USA                                 |   22493

HarperCollins Publishers                                     |   71782
HarperCollins Publishers Ltd                                 |    5142

Prentice Hall                                                |   45137
Prentice-Hall                                                |   18787

AuthorHouse                                                  |   34869
Authorhouse                                                  |   12002

Elsevier Science & Technology Books                          |   31759
Elsevier Science & Technology                                |    8410

Bloomsbury Publishing Plc                                    |   28060
Bloomsbury Publishing PLC                                    |    2945

Xlibris Corporation                                          |   23886
Xlibris Corporation LLC                                      |    9708

HarperCollins Publishers Limited                             |   23817
HarperCollins Publishers Ltd                                 |    5142

HarperCollins                                                |   21504
Harpercollins                                                |    6907

HardPress                                                    |   19645
Hard Press                                                   |    4888

Houghton Mifflin                                             |   14349
Houghton Mifflin Co                                          |    2861

University of Chicago Press                                  |   12347
University Of Chicago Press                                  |    3608

s.n.]                                                        |   11950
[s.n.]                                                       |    7878

Adamant Media Corporation                                    |   11859
Adams Media Corporation                                      |    4712

Kendall Hunt Publishing Company                              |   11189
Kendall/Hunt Publishing Company                              |    9586

Polity Press                                                 |    9058
Policy Press                                                 |    3650

University of North Carolina Press                           |    8967
University of South Carolina Press                           |    2921

Springer Berlin / Heidelberg                                 |    8077
Springer Berlin Heidelberg                                   |    7185

Johns Hopkins University Press                               |    7543
The Johns Hopkins University Press                           |    2660

VS Verlag fur Sozialwissenschaften GmbH                      |    5821
VS Verlag für Sozialwissenschaften                           |    4197

Gale NCCO, Print Editions                                    |    5532
Gale ECCO, Print Editions                                    |    3431

Scott Foresman                                               |    5511
Scott, Foresman                                              |    1912

Audible Studios on Brilliance Audio                          |    5271
Audible Studios on Brilliance                                |    4377

Golden Books                                                 |    4880
Golden books                                                 |    2264

Springer VS                                                  |    4415
Springer US                                                  |    3506

Farrar, Straus & Giroux                                      |    4142
Farrar, Straus and Giroux                                    |    4068

Macmillan Education                                          |    4133
Macmillan Education Ltd                                      |    2701

Springer Vieweg. in Springer Fachmedien Wiesbaden GmbH       |    4120
Springer Gabler. in Springer Fachmedien Wiesbaden GmbH       |    2488

Rowman & Littlefield Publishers, Inc.                        |    3684
Rowman & Littlefield Publishers                              |    2910

Children's Press                                             |    3376
Childrens Press                                              |    2880

Weidenfeld & Nicolson                                        |    2735
Weidenfeld and Nicolson                                      |    2677

W.W. Norton                                                  |    2717
W. W. Norton                                                 |    1946

Quarto Publishing Group UK                                   |    2660
Quarto Publishing Group USA                                  |    2138

Simon & Schuster Books for Young Readers                     |    2413
Simon & Schuster Books For Young Readers                     |    1977

Steck-Vaughn                                                 |    2080
Steck Vaughn                                                 |    2059

Some of these cases are legitimately distinct publishers (ex: Oxford University Press vs Oxford University Press, USA). However, as you can see from the data, many of the publishers are the same, but have different capitalization, spacing or abbreviations.

Is there a mechanism on Open Library's backend to merge some of these publishers together in the more egregious cases? Could the problem be solved by providing a canonical Open Library ID (similar to works, editions, authors etc) for publishers? I'd be happy to help contribute if there's a lack of capacity to address this particular problem.

Hope the data is at least somewhat helpful. Thanks!

LeadSongDog commented 2 years ago

That’s another way of approaching #372. Suggest you look at sorting on ISBN: publishers buy ISBNs from registrars en block, so within a block similar publisher name spellings could safely be consolidated.

mekarpeles commented 2 years ago

@rahulsabbineni would you be open to helping us create a python bot to fix some of these?

rahulsabbineni commented 2 years ago

@mekarpeles Sounds good! I've just completed the Google form for access to Slack and I'll take a look at getting the developer environment setup.

rahulsabbineni commented 2 years ago

I've decided to scrape book data from publishers to create a smaller, curated data set instead of using OL. As such, I will not be able to work on this issue. Please close the issue if it's not important / redundant. Thanks!