internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Sync IA ids latest data dump for August 2023 #8289

Closed cdrini closed 1 year ago

cdrini commented 1 year ago

Same as #7217

Q: Could we add a task to our Monthly Data Dumps cron to automate this?

scottbarnes commented 1 year ago

I started on this, but @hornc pointed out that updating the source_records field, as the script was doing, is a mistake, because:

Here that data is merely being linked after being imported from elsewhere, as these items were added to Open Library before they were scanned by Internet Archive.

One suggestion is to simply re-import the item to trigger the usual import process, which will do all the right things. For many items, this works well. For example, consider the diff for OL47730945M, where a cover and the number_of_pages is filled in.

However, part of the reason some of these items are not linked is because the current importer rejects them.

Consider 007exoticlocatio0000arno/OL8582004M, which does appear to be a book:

❯ http POST https://openlibrary.org/api/import/ia identifier==007exoticlocatio0000arno require_marc==false bulk_marc==false Cookie:$OL_PROD_COOKIE
HTTP/1.1 400 Bad Request
[...]
{
    "error": "Item rejected",
    "error_code": "item-not-book",
    "success": false
}

One strategy would be to go through the list, attempt to re-import all the items, and then track anything that replies with a 400 status code, examine those, and figure out how to best address them as a separate matter.

Thoughts?

@cdrini

cdrini commented 1 year ago

Scott notes that the sync-up is complete ; we're now rerunning reconcile with Charles' suggestion to hit the import endpoint. This is working and adding the extra data Scott noted above. It's slower so will take ~7days. But since the OCAID sync is complete we can close this issue.

130,775 ocaids were synced! 🥳

You can monitor the new reconcile run here: https://openlibrary.org/people/scott365bot