internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.26k stars 1.4k forks source link

MARC 100$q fuller_name not being imported #7408

Open tfmorris opened 1 year ago

tfmorris commented 1 year ago

As mentioned in my comment to #7349 MARC 100$q subfields aren't being imported:

Actually, the problem is worse than described for this specific example because the MARC record actually includes the given name, but it got dropped during the import. The edition which caused this author record to get created is linked to MARC record where we can see that the author's given name is included in a 100$q subfield.

Originally posted by @tfmorris in https://github.com/internetarchive/openlibrary/issues/7349#issuecomment-1368295193

tfmorris commented 1 year ago

@hornc @mekarpeles Because I created this from a comment, it didn't get any of the standard boilerplate tags which are in the new issue template. It's a bug that I've split out from #7349 because I have a fix in hand for it. It's another Day 0 bug from 2010 which has never worked and was never tested because the test cases were built from the output of the existing implementation.

tfmorris commented 1 year ago

Oops! This is a duplicate of #2103.

tfmorris commented 1 year ago

As @hornc pointed out on the PR, there are a number of problems with using the fuller_name field. In addition to not being visible/editable/searchable, it's also associated with a particular name form which was imported from the MARC record, but there's no way to maintain that correspondence .

I plan to drop the fuller_name field and instead construct a name to be added to alternate_names. I've looked at several tens of thousands of records with 100$q from various sources and of various vintages and have found a variety of cases. Here are some that can be handled easily (about 90%+ of the occurrences):

  1. Smith, J. S. (John Steven) => John Steven Smith this is the simplest case
  2. Smith, John S. (John Steven) => John Steven Smith similar to above, but with repeated name instead of abbreviation
  3. Smith, J.-M. (Jean-Michel) => Jean-Michel Smith

And some which we punt on (at least for now), appending the $q to the normalized name:

  1. Smith, John (John Steven) => John Smith (John Steven) - this is often a name to be added, but we can't be 100% sure
  2. Smith, Wm. (William) => Wm. Smith (William) - there are a large number of abbreviations which could be handled with more work
  3. =100 1\$aJohnson, Tim$q(Timothy Byron) => Tim Johnson (Timothy Byron)
  4. =100 1\$aHubbard, L. Ron$q(La Fayette Ron) => L. Ron Hubbard (La Fayette Ron) - multipart names

A secondary question is which name form should be the primary name and which should go in alternate_names. @seabelis @hornc @mekarpeles any opinions?

tfmorris commented 3 months ago

Support for $q (and $6 linkage) needs to be added for not only 100 fields, but also 600, 700, and 800 fields.

https://www.loc.gov/marc/bibliographic/bdx00.html