Open tfmorris opened 6 years ago
The quality of author names in the source file, https://archive.org/download/openlibrary-bots/wish_list_march_2018.ndjson is not the best, there are lots of other notes etc in the data.
e.g. https://openlibrary.org/authors/OL7418142A/S._S._Associated_name_Author_Other_Shatalin
and four copies of https://openlibrary.org/authors/OL7418117A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418116A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418115A/R._C._1907-1990_Author_Editor_Other_Johnston https://openlibrary.org/authors/OL7418114A/R._C._1907-1990_Author_Editor_Other_Johnston
My concern about large scale data imports has always been that we be careful not to make our data quality issues worse at the expense playing the "numbers game" to bulk up.
Perhaps I just got unlucky, but the very first wishlist bot addition that I looked at (linked from the OpenLibrary blog post) had three duplicated works and two duplicated authors, both with badly formatted names.
https://openlibrary.org/works/OL17890901W/Eagle's_Trees_and_shrubs_of_New_Zealand. https://openlibrary.org/works/OL17900501W/Eagle's_Trees_and_shrubs_of_New_Zealand. https://openlibrary.org/works/OL17900497W/Eagle's_Trees_and_shrubs_of_New_Zealand.
Eagle, Audrey Lily - https://openlibrary.org/authors/OL7416671A/Eagle_Audrey_Lily Audrey LilyEagle - https://openlibrary.org/authors/OL7417982A/Audrey_LilyEagle
Although adding "1000 books" sounds like a relatively small sample, if this single book is representative, we now have many thousands of records to clean up.