internetarchive / openlibrary-bots

A repository of cleanup bots implementing the openlibrary-client
Other
62 stars 49 forks source link

Wishlist bot creating duplicate works & badly formatted authors #24

Open tfmorris opened 6 years ago

tfmorris commented 6 years ago

My concern about large scale data imports has always been that we be careful not to make our data quality issues worse at the expense playing the "numbers game" to bulk up.

Perhaps I just got unlucky, but the very first wishlist bot addition that I looked at (linked from the OpenLibrary blog post) had three duplicated works and two duplicated authors, both with badly formatted names.

https://openlibrary.org/works/OL17890901W/Eagle's_Trees_and_shrubs_of_New_Zealand. https://openlibrary.org/works/OL17900501W/Eagle's_Trees_and_shrubs_of_New_Zealand. https://openlibrary.org/works/OL17900497W/Eagle's_Trees_and_shrubs_of_New_Zealand.

Eagle, Audrey Lily - https://openlibrary.org/authors/OL7416671A/Eagle_Audrey_Lily Audrey LilyEagle - https://openlibrary.org/authors/OL7417982A/Audrey_LilyEagle

Although adding "1000 books" sounds like a relatively small sample, if this single book is representative, we now have many thousands of records to clean up.

hornc commented 6 years ago

The quality of author names in the source file, https://archive.org/download/openlibrary-bots/wish_list_march_2018.ndjson is not the best, there are lots of other notes etc in the data.

e.g. https://openlibrary.org/authors/OL7418142A/S._S._Associated_name_Author_Other_Shatalin

and four copies of https://openlibrary.org/authors/OL7418117A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418116A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418115A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418114A/R._C._1907-1990_Author_Editor_Other_Johnston