internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.16k stars 1.35k forks source link

Create interface for splitting conflated author names #5739

Open ghost opened 3 years ago

ghost commented 3 years ago

Author renaming is really tedious when you have a name that needs to be split into several names. For example, a bad import might create an author with the name John Smith Jane Doe Roger Anderson which should have been three separate author records, one each for John Smith, Jane Doe, and Roger Anderson.

Current Process

Sometimes this author only has one associated work, which makes an easy fix. First, I identify the author identifier for one of the three authors, for example John Smith. I edit the work record and add the other two of the three authors, for example by adding Jane Doe and Roger Anderson underneath John Smith Jane Doe Roger Anderson. Then I go back to the original author record and rename the author name from John Smith Jane Doe Roger Anderson to John Smith. Then I merge this record with the correct, existing author record I found for John Smith.

I've been getting a fair amount of errors preventing me from deleting an author record, so I've been merging instead. If there were less errors, I'd be fine with adding all three authors separately and then deleting the original incorrect author record.

The problem with this approach comes when that original record is attached to many works. Then this process becomes super tedious and error-prone. I have to open every work record and repeat the process.

Proposal

I like how the "add author" interface on the Work Details tab on a work record lets me look up authors. Can we create an "author renaming" interface that lets me do the same process, but which would apply to all affected works?

image

Stakeholders

@libjenner @seabelis @RayBB @jimman2003

ghost commented 3 years ago

I'd be okay in the short term with some kind of API call combo that we can call from an external tool or the command line.

LeadSongDog commented 3 years ago

I like the idea, but of course care is needed. That could easily have been John Anderson; Jane Smith; Roger Doe. Also, where should the author record redirect be targeted? Perhaps a list of soft redirects would be more advisable.

ghost commented 3 years ago

Absolutely agree w/ @LeadSongDog. Initially I was kicking around the idea of the tool even suggesting authors by name, but too much automation would lead to an editor likely skipping the due diligence of figuring out which authors are correct. I had to work through a surprising amount of Jane John Smith Roger Anderson Doe during a recent pass through software conference proceedings & technical manuals.

For this issue, I don't need any intelligence or suggestions, I just want to remove the duplicate clicks that happen when an author record needs to be split into two or more authors but it has a lot of works.

~Ideally I think we should fully delete the now-empty and imprecise original author record. Is there anything we would lose by doing that? Is it better to redirect just in case someone was pointing to that author record directly?~ Edited to reflect the following comment from @seabelis.

seabelis commented 3 years ago

@libjenner Never delete valid author records. The process you described for correcting the conflated record,

"Then I go back to the original author record and rename the author name from John Smith Jane Doe Roger Anderson to John Smith. Then I merge this record with the correct, existing author record I found for John Smith."

is correct.

tfmorris commented 2 years ago

@seabelis Of course valid author records should never be deleted, but a record which names three different people (or a record that has a single name, but includes works by many different authors sharing that name) isn't valid. These records should never have existed in the first place, so deleting them is absolutely the correct thing to do, in my opinion.

A good test is "Could this record have ever qualified for an identifier from VIAF, Wikidata, etc?" The answer will always be "No" for conflated author records.

seabelis commented 2 years ago

@tfmorris These conflated authors are typically imported with a single work. So the import process should have split the author into individual profiles, but for whatever reason did not. The primary author of work A is author A. The rest is extra information that should be removed from that profile. Ideally, if the other names are valid authors at the work level, they should be added to the work independently.

Realmbird commented 4 months ago

@RayBB Can I work on this issue

RayBB commented 4 months ago

@Realmbird you're assigned. This one is definitely a bit more complicated so don't be afraid to ask questions.