internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Sort by "item age" (age of entry, when item was entered) #8709

Open Freso opened 8 months ago

Freso commented 8 months ago

Describe the problem that you'd like solved

Trying to figure out the earliest entered Work (or other entity) in the database from a list such as https://openlibrary.org/authors/OL2317566A/Frank_Bohn can be really difficult if there are several pages and you have to manually compared the OL…W ids of the URLs.

(In this specific case, I am trying to figure out what "Frank Bohn" was added for so I can move the correct Works out of him into a new/another Frank Bohn, but it seems OL2317566A was created in 2008 and all the Works associated with it in 2009, so not all that helpful here, but it’s something I’ve been wanting for other entities before too.)

Proposal & Constraints

Add a "sorted by" option for filtering by "item age" or something else. Due to the technical nature of it, this could be limited to librarians. image

Additional context

Stakeholders

Freso commented 3 months ago

Being super librarian now and reviewing other people’s merges and coming across Author merge candidates with 100’s of Works associated with them, it gets pretty much impossible to figure out the oldest associated Work by hand to see which Work an Author was added for.

tfmorris commented 3 months ago

Works didn't exist in 2008. The first works got created in Q4 2009 from the books (editions) which existed at the time.

Even today, although authors get assigned to works, they come from editions. Additionally, using the first book that an author was assigned to as the "real" meaning of the author record doesn't always work because even library MARC records conflate authors, particularly those with no associated birth-death dates. On a MARC record, an author is traditionally just a string, so two different "John Smith"s are indistinguishable from each other.

Unless an author record can clearly (and easily) be associated with the correct author, the best thing to do is move all the works to the appropriate individual author records (creating new ones as needed) and delete the conflated record when it's empty.

Freso commented 3 months ago

The first works got created in Q4 2009

When I wrote that comment yesterday, I was looking at entities created in the 2010’s and 2020’s. Also, I get that this isn’t the one and only way to determine which book an author was added for, but it would be one more tool in the toolbox and might also be useful for other things.

Unless an author record can clearly (and easily) be associated with the correct author, the best thing to do is move all the works to the appropriate individual author records (creating new ones as needed) and delete the conflated record when it's empty.

That is contrary to the Librarian documentation: https://github.com/onnotasler/OpenLibraryDocumentation/blob/main/Librarians-Deletion.adoc https://github.com/onnotasler/OpenLibraryDocumentation/blob/main/Librarians-Merge-DuplicateAuthors.adoc + comments repeatedly made by @seabelis in Slack and elsewhere

As such, this would constitute a change in policy and seems like something that should be run by @seabelis before being communicated like this. (Probably in its own (separate) issue, since this is unrelated to being able to sort by age and thus off-topic for the functionality of this feature request, even if it may be relevant for its motivation.)

tfmorris commented 3 months ago

Unless an author record can clearly (and easily) be associated with the correct author, the best thing to do is move all the works to the appropriate individual author records (creating new ones as needed) and delete the conflated record when it's empty.

That is contrary to the Librarian documentation: https://github.com/onnotasler/OpenLibraryDocumentation/blob/main/Librarians-Deletion.adoc https://github.com/onnotasler/OpenLibraryDocumentation/blob/main/Librarians-Merge-DuplicateAuthors.adoc + comments repeatedly made by @seabelis in Slack and elsewhere

I don't see anything in either of those documents which addresses the case that I'm talking about. When they talk about conflated author records, they're talking about the junk imported from BWB of the form "Billy Bob and Sally Sue, co-authors, illustrated by Hergé." The types of records that I'm talking about are these: https://openlibrary.org/search/authors?q=undifferentiated. There is no way that anyone is ever going to be able to figure out what person is the first/best/etc for Smith. There's even an entry in the Library of Congress Name Authority File (LCNAF) for Smith because it was used historically, but it's tagged as ambiguous and forbidden for use.

If it's easy to identify what person a conflated record should represent, sure, go ahead and fix it, but in a lot of cases that conflation occurred before the record was imported because the source library's authority file had a generic "Smith, John" entry in it which was used in MARC records for books authored by multiple different people.

seabelis commented 3 months ago

@Freso You can try and find the earliest associated edition for an author using this query (please note that you may need to adjust the limit) https://openlibrary.org/query.json?type=/type/edition&authors=/authors/OL2657659A&limit=100. You can then sort the returned IDs to find the oldest. Please do not delete an author record and please do not add "undifferentiated" to the author name.

seabelis commented 3 months ago

It would be helpful if, when a work or author is created, that the original source is noted and unmodifiable so that there is always a single source of truth. Presently these source records only exist for editions and some authors and works, but sometimes are lost on author and work records if a merge is reversed.

Freso commented 3 months ago

It would be helpful if, when a work or author is created, that the original source is noted and unmodifiable so that there is always a single source of truth.

Not quite what you’re asking for, but somewhat related: https://github.com/internetarchive/openlibrary/issues/9470 :)

mekarpeles commented 3 months ago

@seabelis relying on you to tell us what you think should be done here :)