internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

~1000 records have a bad source: ia:ic value #8215

Closed hornc closed 1 year ago

hornc commented 1 year ago

Some items have a bad source: value which is throwing out import lookups when items are scanned by archive.org -- we should have a MARC record, but it is causing errors as the MARC source is not returned in the expected format, rather in this shortened form which excludes all the specific location and offset data.

ia:ic -- these look like single item archive.org sources, but they are supposed to be bulk MARC sources

example:

https://openlibrary.org/books/OL22347301M/Goethe%27s_Faust

The history UI correctly shows a link to the MARC source:

https://openlibrary.org/show-records/marc_ithaca_college/ic_marc.mrc:64550055:881

But this location is NOT visible in the metadata.

Apparently it is coming from the machine_comment, https://github.com/search?q=repo%3Ainternetarchive%2Fopenlibrary%20machine_comment&type=code

I'd like to see these sources replaced with the correct values from the machine comment, e.g. replacing "source": "ia:ic" with "source": "marc:marc_ithaca_college/ic_marc.mrc:64550055:881"

Related files

https://github.com/search?q=repo%3Ainternetarchive%2Fopenlibrary%20machine_comment&type=code

List of OLids with this problem: ic_source.txt

Stakeholders

hornc commented 1 year ago

@cdrini Do you know a method for getting the machine_comment without direct db access?

hornc commented 1 year ago

Machine comment is on the history rather than the record, and the earliest one can be retrieved on the command line with:

curl "https://openlibrary.org/books/OL22347301M.json?m=history" | jq ".[-1].machine_comment"

=> "marc_ithaca_college/ic_marc.mrc:64550055:881"

hornc commented 1 year ago

Getting a list of suspect short ia: sources (less than 6 chars -- are there any more of these?):

zgrep "\"ia:[^\"]\{,6\}\"" ol_dump_editions_2023-08-31.txt.gz > short_ia_source.tsv

Found one record with just ia: -- fixed manually: https://openlibrary.org/books/OL13489258M/Children%27s_Perception_of_Sarkar

hornc commented 1 year ago

All machine_comment sources have been written back into source_records, replacing the ia:ic values for the list above.

reported example: https://openlibrary.org/books/OL22347301M/Goethe's_Faust?_compare=Compare&b=5&a=4&m=diff image

Code used:

#!/usr/bin/env python3
import sys
from olclient.openlibrary import OpenLibrary
ol = OpenLibrary()

def mc(olid):
   history = ol.session.get(f'https://openlibrary.org/books/{olid}.json?m=history').json()
   return history[-1]['machine_comment']

with open(sys.argv[1]) as f:
   icia_olids  = f.read().splitlines()

for olid in icia_olids:
   a = ol.get(olid)
   try:
       i = a.source_records.index('ia:ic')
       a.source_records[i] = mc(olid)
       print(a.save('correct ia:ic source_record'))
   except ValueError:
       print(f"ia:ic not found in {olid} source_record, skipping.") 

reviewing this now I probably should have had the code confirming that the machine_comment was in the expected marc_ithaca_college/ic_marc.mrc format before writing it, but that seems to be the case for all of them anyway. This exact code should never need to be used again, but sharing it as a reference if anything similar needs to be written. Getting the machine_comment from history isn't really explained anywhere.