Sync IA ids latest data dump for August 2023

cdrini commented 1 year ago

Same as #7217

Q: Could we add a task to our Monthly Data Dumps cron to automate this?

[x] @cdrini Generate partial IA dump of works of interest
[x] @scottbarnes Run reconcile on new dump

scottbarnes commented 1 year ago

I started on this, but @hornc pointed out that updating the source_records field, as the script was doing, is a mistake, because:

having source_records set implies that a record was imported or re-imported from the archive.org MARC record or metadata; and
adding a source_record without the import might give the wrong impression about whether the data was taken from archive.org or merely linked after an import from elsewhere.

Here that data is merely being linked after being imported from elsewhere, as these items were added to Open Library before they were scanned by Internet Archive.

One suggestion is to simply re-import the item to trigger the usual import process, which will do all the right things. For many items, this works well. For example, consider the diff for OL47730945M, where a cover and the number_of_pages is filled in.

However, part of the reason some of these items are not linked is because the current importer rejects them.

Consider 007exoticlocatio0000arno/OL8582004M, which does appear to be a book:

❯ http POST https://openlibrary.org/api/import/ia identifier==007exoticlocatio0000arno require_marc==false bulk_marc==false Cookie:$OL_PROD_COOKIE
HTTP/1.1 400 Bad Request
[...]

{
    "error": "Item rejected",
    "error_code": "item-not-book",
    "success": false
}

One strategy would be to go through the list, attempt to re-import all the items, and then track anything that replies with a 400 status code, examine those, and figure out how to best address them as a separate matter.

Thoughts?

@cdrini

cdrini commented 1 year ago

Scott notes that the sync-up is complete ; we're now rerunning reconcile with Charles' suggestion to hit the import endpoint. This is working and adding the extra data Scott noted above. It's slower so will take ~7days. But since the OCAID sync is complete we can close this issue.

130,775 ocaids were synced! 🥳

You can monitor the new reconcile run here: https://openlibrary.org/people/scott365bot

internetarchive / openlibrary

Sync IA ids latest data dump for August 2023 #8289