Open mekarpeles opened 6 years ago
I'm confused as to why this is a ticket on a client library/tool repo. It sounds like an OpenLibrary data question.
To answer the question, no, not in any systematic fashion, but it's known to be a giant problem. Even for someone as popular as Danielle Steel who draws a huge number of eyes, there were four duplicate records when I merged them just now.
I created a list of duplicate author names (not all representing duplicate author records) and the top of the list looks like this (the first column is the count of the duplicates):
2368 Colloquy on European Law. 1981 Messina, Italy) 153 Leif Mejlbro 152 127 delete 108 Dorothy Ledbetter Murray 107 Stephen Moffat, The Mouse Training Company 87 Various 80 Aldous Huxley 72 Canada. British Columbia. Parks Branch. 71 Paul Newton 68 Patrick Vernon Murray 67 Dorothy Leadbetter Murray 66 Dixie Owens Murray 65 United Nations 64 British Standards Institution 61 Robert Alan Hill 60 Larry Walther 52 Delete 51 MTD Training 50 delete duplicate 49 Prof. Dr AP Faure 48 Great Britain
I did a quick pass through the duplicates down to 15x and merged probably 1,000 author records.
As far as author records with no works go, I think we should attempt to identify their source (e.g. importer bug) and then summarily delete the ones which have never been used.
Here's a more varied example (not identical names) that I ran across randomly this evening. 17 duplicates, 15 with works, totalling 150 works records.
I don't think I ever search for an author and not see at least one or two duplicates. What are other people's experiences?
It is certainly very common. The longstanding problem with the add book dialog of not seeing newly created authors in the drop down list led to huge numbers of exact or near duplicate author records with just one work. These will need to be cleaned up and the sooner the better. A large subset have:
The single-work author records I referred to were often created sequentially, so we might find sequential author OLIDs to be common if this can be checked.
Have we gone through the process of searching all authors with works and seeing which are likely duplicates?