internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5k stars 1.26k forks source link

Authors getting incorrect alternate names after merge #498

Open tfmorris opened 7 years ago

tfmorris commented 7 years ago

In looking at the Charles Dickens record, we see the following bad name forms, many of which could be detected and eliminated automatically:

Non-author contributors conflated:

"Flo Gibson (Narrator)", "illustrated by Arthur Rackham Charles Dickens", "Introduction-John Carey", "Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis"

Bad capitalization, spelling, etc:

"Dickens", "DICKENS", "Dickens, Charles", "Charles", "CHARLES DICKENS", "CHARLES. DICKENS", "Charles dickens", "Dickens Charles.", "C DICKENS", "Charled Dickens", "Dickens Charles", "harles Dickens",

Dates or transliterations embedded:

"Dickens, Charles, 1812-1870.", "DICKENS, CHARLES, 1812-1870", "Charles Dickens Charles Dickens", "Charles 1812-1870 Dickens", "Dickens, Charles,$d1812-1870", "Dickens,Charles ディケンズ,チャールズ (1812-1870)",

LeadSongDog commented 7 years ago

Looking at the history https://openlibrary.org/authors/OL24638A/Charles_Dickens?m=history shows that most of that happened on author merges. See the diffs.

tfmorris commented 7 years ago

I assumed as much, but I don't think that makes a difference when it comes to cleanup -- or am I missing something?

LeadSongDog commented 7 years ago

The entry "Mike; Spencer, John (editors) (Charles Dickens; Lord Halifax; Edgar Allan Poe; Bram Stoker; O. Henry; William Mudford; Frederick Marryat; Matthew Lewis; William Makepeace Thackeray; W. W. Jacobs; Saki) Jarvis" came from some W record. Rather than just removing it from the merged A record, that W should be corrected too. The old A will now be a redirect: it should I think be deleted after the Ws are corrected.

LeadSongDog commented 7 years ago

That seems to have happened at https://openlibrary.org/authors/OL2895898A?m=diff&b=2 and https://openlibrary.org/authors/OL24638A/Charles_Dickens?m=diff&b=16

That leaves the question of which work used to link to https://openlibrary.org/authors/OL2895898A?v=1 before https://openlibrary.org/recentchanges/2012/03/04/merge-authors/45842283 was done.

LeadSongDog commented 7 years ago

A simpler case: OL6034980A is "Joseph Barrell " with a trailing space, apparently created (with other similar cases) by ImportBot on 27 Oct 2008 while importing ia:evolutioneartha02huntgoog into OL20493274M. The other authors on that edition were similarly afflicted.

mekarpeles commented 6 years ago

related: #117

xayhewalo commented 4 years ago

@tfmorris Would you recommend fixing this issue at the Infobase level or at the Solr level? Also, are you or @hornc willing to be assignee for this issue? Note, being the assignee doesn't necessarily mean you are responsible for doing the work, just responsible for gathering/providing information to address the issue. From the Wiki.

The assigned owner is not necessarily the person who will fix the issue (it is not necessarily even established, at that point, if or when the issue will be fixed at all), but rather they are the person who will do as much or as little as needed to handle the issue (asking questions, soliciting input, establishing and updating the priority, checking if it is a duplicate, etc).

Once an issue is labeled State: Work In Progress, the owner is the individual doing the work, or leading/coordinating the group that is doing the work.

I've added labels per context: let me know your thoughts

tfmorris commented 4 years ago

We could attempt a short term patch to the Solr updater to mitigate some of the most egregious issues, but long term we need to fix:

Most of the problem is cosmetic, so I consider it lower priority.

pranjii commented 2 years ago

Can I get this?

LeadSongDog commented 1 year ago

Hmm, I had the impression that the trailin spaces had been cleaned up, but I just came upon and fixed OL6027001A that had been untouched since 2008: IMG_1638 IMG_1639

cdrini commented 1 year ago

Basic approach for the alternate names issue:

Use uniq with a custom key fn. The key fn should:

  1. Make case insensitive
  2. Remove punctuation
  3. Optional: Rm anystring that is a substring of the main string.

eg given:

Original Name Converted Name (uniq key)
Charles Dickens charles dickens
CHARLES DICKENS. charles dickens
CHARLES. DICKENS charles dickens

Then we dedupe and choose just one instance of Charles Dickens, because they all have the same uniq key.

tfmorris commented 1 year ago

@cdrini Sounds like you're basically talking about creating the equivalent of an ICU primary strength sorting key (perhaps with some pre-processing cleanup). I suggest you just use PyICU directly to avoid having to reimplement everything, particularly for non-English names. https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

LeadSongDog commented 1 month ago

Low hanging fruit that should be simpler to automate. Here’s a trivial case where the correction is simply to extract the dates from the name and put them in the born and died fields: https://openlibrary.org/authors/OL12376689A/William_Grimshaw?b=2&a=1&_compare=Compare&m=diff