internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.21k stars 1.37k forks source link

Create endpoint for Archive.org to update OCAIDs in OL records #7539

Open jimchamp opened 1 year ago

jimchamp commented 1 year ago

Describe the problem that you'd like solved

Today, there is no way for Internet Archive to notify Open Library of important changes to records. If data becomes out-of-sync between the two platforms, Open Library records must be updated and synced by a human. Before this is done, somebody has to first notice that the data is out-of-sync. This is not ideal...

Proposal & Constraints

Edited proposal: Create a new POST handler and endpoint for syncing OCAID changes. Sync requests must contain S3 credentials, edition olid, the original ocaid, and either is_dark if the item has been made dark, or new_ocaid if the OCAID has been changed.

Additional context

Stakeholders

@mekarpeles

hbromley commented 1 year ago

I had proposed something a bit different further into the Slack thread:

what if archive.org sent OL a general “re-sync this item” request, with OL determining which values had changed since the last sync? the only additional special information archive.org would have to send would be in the rename case, where it would include something like old_id={identifier}, so OL would know which prior values to compare. (the “made dark” case needs no special indication, as OL will find out that the item is dark when it tries to fetch its metadata.)

This situation is different from initial imports because archive.org will be sending the request only for items that have already been previously imported (but have now been changed in some meaningful way), and archive.org will know the key for the corresponding OL record.

That's why I'm suggesting a distinct "re-sync" request, with OL comparing the current archive.org info with the prior sync, and making updates as needed. Could we set up something where archive.org sends, for example, ocaid=goodytwoshoes00newyiala&olid=OL7095326M (or the equivalent in JSON) and OL compares what it previously had and makes updates as needed? It may find that the archive.org item's ISBN or other metadata has changed; if the item has been darked, it will find is_dark: true in the item's metadata record; if the item has been renamed, we could include an old_ocaid=... value in the request.

Does that sound feasible?

mekarpeles commented 1 year ago

How about something like this @hbromley? https://github.com/internetarchive/openlibrary/issues/7543#issuecomment-1431908921

hbromley commented 1 year ago

That is quite different from what I proposed in my previous comment. Is that approach not feasible?

jimchamp commented 1 year ago

Sorry for the delay (and confusion) @hbromley. The new endpoint that you are describing is totally feasible, and I'll start working on it today.

hbromley commented 1 year ago

I don't see how #7720 addresses the request I made in this issue. Please my comments above, here and here.

seabelis commented 1 year ago

I have concerns about items that are linked to multiple records, either the OCAID is associated with multiple OL editions or an Open Library edition ID is associated with multiple IA items. Both of these cases are not uncommon.

Second, the Open Library record may represent an accurate manifestation with an incorrect OCAID. I would not want to update a correct record to match the IA item. Instead the item should be linked to the correct record. If an Open Library record represents a specific manifestation or work, it should always represent that manifestation or work regardless of what it is linked to on IA.

Can be done automatically AND accurately?

mekarpeles commented 1 year ago

@hbromley can I check my understanding? One of the main differences is us fetching the metadata for the archive.org item as opposed to petabox sending it over. Also, the API should handle the dark and rename cases:

Presumably we want an API on openlibrary like openlibrary.org/api/sync/{ia_identifier} that...

openlibrary.org/api/sync/goody?old_ocaid=goodytwoshoes

I think for starters, especially given @seabelis's feedback, we may just want the sync API to (a) fill in missing metadata, (b) disassociate dark ocaids, and (c) handle renaming of ocaids

hbromley commented 1 year ago

Here's another attempt to describe concisely the problem we have, and the solution I had suggested; I'm certainly open to any other solution that might be preferable.

(See the original problem statement in Slack.)

Certain OL records contain links to archive.org items. For example, this record (OL27240549M) has a source_record value of ia:worldofwarcraftv0002metz; correspondingly, the archive.org item has openlibrary_edition=OL27240549M and openlibrary_work=OL20060545W values in its metadata.

I don't know anything about how those linkages are originally established, and I'm not suggesting any changes to whatever that process is, including the treatment of any one-to-many relationships that may exist. The question is simply what should be done if that archive.org item is renamed (as happened in this case), or darked (as also happened in this case), or has critical identifying metadata (like ISBN) altered, as any of those events render the existing linkages invalid.

My suggestion was that OL establish a general "re-sync this archive.org item" endpoint, with the approximate semantics of "redo whatever you did before to link these OL records to this archive.org item, because the old linkage is no longer valid," and with the archive.org code tweaked to contact that endpoint when appropriate—i.e., when archive.org alters an item that has OL linkages in its own metadata, by darking or renaming the item, or by replacing its bibliographic metadata. If the item has been darked, OL will discover that fact when it attempts to read it, and can dissociate the id; if the item has been renamed, archive.org would provide the old identifier, too, so that OL can find the relevant record(s) and update it/them; if the item's bibliographic identifiers have changed, OL can make whatever updates are necessary.

Again, if there's some better way to address the problem, I'm not especially stuck on this one.

cdrini commented 1 year ago

@mekarpeles and I were chatting about possible reindexing a document on IA change as well. If the OL record gets edited, then nothing will be needed and the item will be reindexed by solr-updater. But if it doesn't get edited (eg just an IA collection changes), we'll have to do something like this:

https://github.com/internetarchive/openlibrary/blob/f3c7db292127774a2aed5f87448b0239cc2c8cb6/openlibrary/plugins/admin/code.py#L878-L891

solr-updater reads from this .store value.

mekarpeles commented 1 year ago

Related to #8343

mekarpeles commented 11 months ago

Certain items may get darked in bulk temporarily and so we want to discuss options before moving forward with an automatic solution.