Closed hornc closed 1 year ago
@cdrini Do you know a method for getting the machine_comment
without direct db access?
Machine comment is on the history rather than the record, and the earliest one can be retrieved on the command line with:
curl "https://openlibrary.org/books/OL22347301M.json?m=history" | jq ".[-1].machine_comment"
=> "marc_ithaca_college/ic_marc.mrc:64550055:881"
Getting a list of suspect short ia:
sources (less than 6 chars -- are there any more of these?):
zgrep "\"ia:[^\"]\{,6\}\"" ol_dump_editions_2023-08-31.txt.gz > short_ia_source.tsv
Found one record with just ia:
-- fixed manually: https://openlibrary.org/books/OL13489258M/Children%27s_Perception_of_Sarkar
All machine_comment
sources have been written back into source_records
, replacing the ia:ic
values for the list above.
reported example: https://openlibrary.org/books/OL22347301M/Goethe's_Faust?_compare=Compare&b=5&a=4&m=diff
Code used:
#!/usr/bin/env python3
import sys
from olclient.openlibrary import OpenLibrary
ol = OpenLibrary()
def mc(olid):
history = ol.session.get(f'https://openlibrary.org/books/{olid}.json?m=history').json()
return history[-1]['machine_comment']
with open(sys.argv[1]) as f:
icia_olids = f.read().splitlines()
for olid in icia_olids:
a = ol.get(olid)
try:
i = a.source_records.index('ia:ic')
a.source_records[i] = mc(olid)
print(a.save('correct ia:ic source_record'))
except ValueError:
print(f"ia:ic not found in {olid} source_record, skipping.")
reviewing this now I probably should have had the code confirming that the machine_comment
was in the expected marc_ithaca_college/ic_marc.mrc
format before writing it, but that seems to be the case for all of them anyway. This exact code should never need to be used again, but sharing it as a reference if anything similar needs to be written. Getting the machine_comment
from history isn't really explained anywhere.
Some items have a bad source: value which is throwing out import lookups when items are scanned by archive.org -- we should have a MARC record, but it is causing errors as the MARC source is not returned in the expected format, rather in this shortened form which excludes all the specific location and offset data.
ia:ic
-- these look like single item archive.org sources, but they are supposed to be bulk MARC sourcesexample:
https://openlibrary.org/books/OL22347301M/Goethe%27s_Faust
The history UI correctly shows a link to the MARC source:
https://openlibrary.org/show-records/marc_ithaca_college/ic_marc.mrc:64550055:881
But this location is NOT visible in the metadata.
Apparently it is coming from the
machine_comment
, https://github.com/search?q=repo%3Ainternetarchive%2Fopenlibrary%20machine_comment&type=codeI'd like to see these sources replaced with the correct values from the machine comment, e.g. replacing
"source": "ia:ic"
with"source": "marc:marc_ithaca_college/ic_marc.mrc:64550055:881"
Related files
https://github.com/search?q=repo%3Ainternetarchive%2Fopenlibrary%20machine_comment&type=code
List of OLids with this problem: ic_source.txt
Stakeholders