internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

New book import should not reuse non-specific author records #2274

Open tfmorris opened 5 years ago

tfmorris commented 5 years ago

Description

The importer conflated two different authors because it wasn't appropriately cautious about not reusing author records which consist only of two names without any dates or other supporting evidence.

The bad data came from poor cataloging by the San Francisco Public Library, but we need to screen imports for quality. The author name was cataloged two different ways in the MARC record, providing a hint as to both trouble and the correct author record.

Relevant url?

https://openlibrary.org/books/OL26953710M/Embers_of_war

Expectation

Importers reject low quality data and not make OpenLibrary database worse than it already is.

Stakeholders

@hornc

hornc commented 4 years ago

The other side of this issue appears to be that the import process will not match an existing author with dates if the import data does not have dates,

example new author created: https://openlibrary.org/authors/OL7631978A/Joe_Dever when it ideally should have matched the existing and dated: https://openlibrary.org/authors/OL760963A/Joe_Dever

on this import https://dev.openlibrary.org/books/OL27327055M/Lone_wolf_28_-_The_Hunger_of_Sejanoz There were no existing work matches is this case -- I believe finding an existing work match would have unified the authors.

Given the original issue report, I'm not sure if this example is necessarily incorrect -- it seems to be doing the "right thing" by not assuming the undated name is the same as the dated name.

What I believe will happen currently is any other non dated 'Joe Dever' will be assumed to be by this new undated OL7631978A author record. I'm not aware of any conflation with this particular author, but it illustrates the matching process of dated and undated author records.

I'm not sure what to do about this issue though, it sounds like the proposal is to either A) create a new non-specific author record each time, just in case they are different B) not create anything unless we have at least one date (or other identifier) C) ... be smarter how we use other identifiers, or infer whether a book is likely to be by an existing author by checking existing works by date or subject... or some other means

As it currently stands, non-specific author records are all potentially conflated, while dated author records are harder to match, and will potentially result in works by that author getting attached to a non-specific version if the source data (library MARC or other) has not recorded the dates in the same way. (aside: I have seen examples of VIAF author dates where a single individual appears to have multiple commonly catalogued birth years, so it's not just missing data, but incorrect, or simply 'different' data)

A - seems to me a bit unnecessary, and will create a lot of probably easily avoidable duplicates , but I know it is easy to merge authors, but harder to split. Maybe we need a split author tool interface for librarians instead?

B - I think is too restrictive, not even all library records will meet this standard, and ancient authors don't necessarily have consistent dates. Not sure this is practical, if we want to expand the catalog at all.

C - Is where I think we could do better, but it'll require some analysis and identifying specific situations where we can get more confidence, or other sources of data to combine. The problem as I see it is that both the existing record and the record being imported need full and high quality author data, and it is hard to get even one side near complete. Sometimes an author attribution is just a text string, and no one really has any more data than that.

For how complicated author matching can get, an example I tried to untangle is https://viaf.org/viaf/24746801/ Paul Barnett (1949-), who is not the same author as a scattering of other Paul Barnetts, but is the same author as John Grant (1949-) and writes children's books, but not the same as another very prolific John Grant (1930 - 2014) who also writes children's books, nor a number of other less prolific John Grants. As Paul Barnett his books have a wider range of subjects, so it is hard to tell which are written by the same person. The two children's author John Grants have books which seem more characteristic of the other, and I had to research their careers to confirm which contributed to exactly which kids fantasy series. Paul Barnett (1949-) is also the same author as "Devereux, Eve", pseud. , which is all represented on VIAF, but I had to check the story since the published uses of the name all imply a distinct individual, and it was only really an anecdote written by the author on his personal website a lot later that confirms the story. Basically it isn't possible to get it right without doing active research for some titles, and the printed item itself doesn't have enough information to provide an unambiguous match. Pseudonyms make it harder too, since sometimes the available information increases over time. OL does not currently have a good way to represent author writing as name, which would be a separate feature, but help preserve data and help represent the sometimes complex reality of author attribution.

I don't have any solutions right now, but it seems like non-specific authors are a reality (which exists in the printed book), and specific authors are a value-add that can be determined in many cases, if we are lucky, or a user (or librarian elsewhere) does some research to make the connection. The OL system should support this reality and allow the currently available data to be represented, but have mechanisms to expand that data.

This was meant to be a single link to an somewhat related example, but has turned into a mini-essay :)

tfmorris commented 4 years ago

I think the right approach is to use a scoring algorithm with positive and negative weights for various elements matching / missing / not matching. The resulting score can then be compared to a threshold which determines whether it's a good enough match. This allows us to tune both the weights and the threshold to achieve the behavior that we want.

xayhewalo commented 4 years ago

@hornc Labeling as backlogged per it's category in the Continuous Import Pipeline project