internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.19k stars 1.35k forks source link

Imports may be associating new Editions with /type/redirects Works #5206

Open seabelis opened 3 years ago

seabelis commented 3 years ago

An item on archive.org linked to this Work ID.

Details

  1. Any edition imported and attached erroneously to a /type/redirect work will not show up in search
  2. We likely need to fix our Import pipeline to check first whether a work is a /type/redirect (and this is in something like openlibrary/catalog/__init__.py)
  3. We likely need to clean up these http://openlibrary.org/query.json?type=/type/redirect&subjects~=*&limit=1000 <-- this will need to be pages over as there are more than 1,000 results and 1,000 is the query limit.

Relevant url?

https://openlibrary.org/works/OL471576W?debug=true

Steps to Reproduce

  1. Go to ...https://openlibrary.org/works/OL471576W
  2. Do ...observe error

We've noted the error 2021-05-25/070057688265 and will look into it as soon as possible. Head for home?"

Details

Proposal & Constraints

Related files

Stakeholders

cdrini commented 3 years ago

I fixed this specific work by reverting it back to a good state. Clean up Bot looks like it associated new editions with a redirect :/ https://openlibrary.org/works/OL471576W/Asesinato_En_El_Orient_Murder_on_the_Orient_Express This should likely be merged.

cdrini commented 3 years ago

Grrr, it looks like there are a lot of these :/ http://openlibrary.org/query.json?type=/type/redirect&subjects~=*&limit=100

mekarpeles commented 3 years ago

@BharatKalluri this may be a good one for us to investigate together w/ @cdrini

LeadSongDog commented 3 years ago

So https://openlibrary.org/works/OL20890W.json returns a result, but https://openlibrary.org/works/OL20890W gives the above error. I note the json shows no hint of a title.

hornc commented 3 years ago

@seabelis I had a look at one of the results from @cdrini 's query above

https://openlibrary.org/works/OL15336690W (this page errors because of something in /booklending_utils/booklending_utils/openlibrary.py in is_exclusion with debug mode on)

The history can be viewed here: https://openlibrary.org/works/OL15336690W.json?m=history

The import API uses the search index to find edition matches for the supplied import data, so I thought maybe the search index was out of date, but it seems there are many current editions which contain this work in their metadata.

The merge was made on 2021-04-14 with v=18, but there are still many editions which refer to this work in the data dumps, not just the search index:

/type/edition   /books/OL13715125M  4   2010-08-17T23:54:37.556294  {"publishers": ["Dent"], "title": "Old Mortality", "series": ["Dent's temple series of English texts"], "created": {"type": "/type/datetime", "value": "2008-08-31T00:39:50.062937"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1907", "publish_country": "xxk", "key": "/books/OL13715125M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "edited with introduction, notes and glossary by A.J. Grieve; with numerous illustrations.", "publish_places": ["London"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL16315267M  4   2010-08-17T23:54:37.556294  {"publishers": ["Harper"], "pagination": "xvii, 441 p., [9] leaves of plates :", "title": "Old mortality", "series": ["The Waverley novels -- v. 7"], "notes": {"type": "/type/text", "value": "Includes index"}, "number_of_pages": 441, "created": {"type": "/type/datetime", "value": "2008-09-23T07:15:42.002757"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1800", "publish_country": "nyu", "key": "/books/OL16315267M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "by Sir Walter Scott", "publish_places": ["New York"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL5980885M   5   2020-09-30T20:12:01.905778  {"publishers": ["Houghton Mifflin"], "subject_place": ["Scotland"], "pagination": "xxi, 392 p.", "lc_classifications": ["PZ3.S43 O25", "PR5320.O4 O25"], "latest_revision": 5, "key": "/books/OL5980885M", "authors": [{"key": "/authors/OL75235A"}], "publish_places": ["Boston"], "contributions": ["Welsh, Alexander., ed."], "subject_time": ["1660-1688"], "genres": ["Fiction."], "source_records": ["marc:marc_loc_2016/BooksAll.2016.part06.utf8:50331756:889"], "title": "Old Mortality.", "lccn": ["66009577"], "notes": {"type": "/type/text", "value": "Bibliography: p. xxi.\n\"The Riverside edition ... follows the revised edition of 1830.\""}, "number_of_pages": 392, "created": {"type": "/type/datetime", "value": "2008-04-01T03:28:50.625462"}, "languages": [{"key": "/languages/eng"}], "subjects": ["Bothwell Bridge, Battle of, Scotland, 1679 -- Fiction.", "Scotland -- History -- 1660-1688 -- Fiction."], "publish_date": "1966", "publish_country": "mau", "last_modified": {"type": "/type/datetime", "value": "2020-09-30T20:12:01.905778"}, "series": ["Riverside editions, B 98"], "by_statement": "Edited with an introd. and notes by Alexander Welsh.", "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "revision": 5}
/type/edition   /books/OL13649816M  4   2010-08-17T23:54:37.556294  {"publishers": ["Dent", "Dutton."], "title": "Old mortality", "dewey_decimal_class": ["823.8"], "series": ["Everyman's library -- no.137"], "notes": {"type": "/type/text", "value": "1st published in Everyman's library, 1906."}, "created": {"type": "/type/datetime", "value": "2008-08-30T18:23:05.211028"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1964", "publish_country": "xxk", "key": "/books/OL13649816M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "preface and glossary by W.M. Parker.", "publish_places": ["London", "New York"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL22305660M  4   2010-08-17T23:54:37.556294  {"publishers": ["Robert Cadell", "Houlston and Stoneman"], "pagination": "3 v. :", "revision": 4, "title": "Old mortality.", "series": ["[Hinman collection]", "Waverley novels"], "notes": {"type": "/type/text", "value": "Spine title: Works of Sir Walter Scott.\n\nIncluded in volumes with: Black dwarf ; and Heart of Mid-Lothian, pt.1."}, "created": {"type": "/type/datetime", "value": "2008-11-10T01:05:48.865943"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1849", "location": ["BIN"], "key": "/books/OL22305660M", "authors": [{"key": "/authors/OL75235A"}], "latest_revision": 4, "publish_places": ["Edinburgh", "London"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "publish_country": "enk"}
/type/edition   /books/OL24350511M  14  2011-11-23T09:58:01.698919  {"other_titles": ["Presbyt\u00e9riens d'\u00c9cosse."], "publishers": ["F. Didot, fr\u00e8res"], "subtitle": "ou, Les presbyt\u00e9riens d'\u00c9cosse", "covers": [6469046], "last_modified": {"type": "/type/datetime", "value": "2011-11-23T09:58:01.698919"}, "latest_revision": 14, "key": "/books/OL24350511M", "authors": [{"key": "/authors/OL75235A"}], "ocaid": "levieillarddesto00scot", "publish_places": ["Paris"], "contributions": ["Mont\u00e9mont, Albert, 1788-1861"], "pagination": "[3], 268 p.", "source_records": ["ia:levieillarddesto00scot"], "title": "Le vieillard des tombeaux", "work_titles": ["Old mortality"], "notes": {"type": "/type/text", "value": "Sabl\u00e9 copy: pages 257-268 wanting."}, "number_of_pages": 268, "created": {"type": "/type/datetime", "value": "2010-09-01T18:24:02.721287"}, "languages": [{"key": "/languages/fre"}], "publish_date": "1835", "publish_country": "fr ", "by_statement": "par Walter Scott ; traduction nouvelle par M. Albert Mont\u00e9mont", "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "revision": 14}
/type/edition   /books/OL13909096M  4   2010-08-17T23:54:37.556294  {"publishers": ["Service and Paton"], "title": "Old mortality", "dewey_decimal_class": ["823.8"], "series": ["The illustrated English library"], "created": {"type": "/type/datetime", "value": "2008-09-02T02:13:23.078278"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1898", "publish_country": "xxk", "key": "/books/OL13909096M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "by Sir Walter Scott ; with sixteen illustrations by Sidney Paget.", "publish_places": ["London"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL7383832M   9   2010-08-17T23:54:37.556294  {"publishers": ["Oxford University Press, USA"], "languages": [{"key": "/languages/eng"}], "identifiers": {"goodreads": ["2834492"], "librarything": ["19546"]}, "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "title": "Old Mortality (Oxford World's Classics)", "contributions": ["Jane Stevenson (Editor)", "Peter Davidson (Editor)"], "number_of_pages": 561, "covers": [118678], "created": {"type": "/type/datetime", "value": "2008-04-29T13:35:46.876380"}, "isbn_13": ["9780192826305"], "isbn_10": ["0192826301"], "publish_date": "November 18, 1993", "key": "/books/OL7383832M", "authors": [{"key": "/authors/OL75235A"}], "latest_revision": 9, "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "revision": 9}
/type/edition   /books/OL13748824M  4   2010-08-17T23:54:37.556294  {"publishers": ["Adam & Charles Black"], "languages": [{"key": "/languages/eng"}], "title": "Old Mortality.", "series": ["Waverley Novels -- Vol 5"], "created": {"type": "/type/datetime", "value": "2008-08-31T16:11:35.412876"}, "edition_name": "Centenary ed.", "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1880", "publish_country": "xxk", "key": "/books/OL13748824M", "authors": [{"key": "/authors/OL75235A"}], "latest_revision": 4, "publish_places": ["Edinburgh"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "revision": 4}
/type/edition   /books/OL13767412M  4   2010-08-17T23:54:37.556294  {"publishers": ["Nimmo"], "pagination": "627p.", "title": "Old mortality", "series": ["Waverley novels -- vol.5"], "number_of_pages": 627, "created": {"type": "/type/datetime", "value": "2008-08-31T17:26:54.190561"}, "languages": [{"key": "/languages/eng"}], "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "publish_date": "1898", "publish_country": "xxk", "key": "/books/OL13767412M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "with introductory essay and notes by Andrew Lang.", "publish_places": ["London"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL16765567M  4   2010-08-17T23:54:37.556294  {"publishers": ["J.M. Dent", "E.P. Dutton"], "pagination": "xi, 454 p.", "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "title": "Old Mortality", "series": ["Everyman's library -- no. 137"], "number_of_pages": 454, "created": {"type": "/type/datetime", "value": "2008-09-25T23:31:35.725994"}, "languages": [{"key": "/languages/eng"}], "subjects": ["Covenanters -- Fiction", "Bothwell Bridge, Battle of, Scotland, 1679 -- Fiction"], "publish_date": "1906", "publish_country": "enk", "key": "/books/OL16765567M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "by Sir Walter Scott", "publish_places": ["London", "New York"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
/type/edition   /books/OL16791965M  4   2010-08-17T23:54:37.556294  {"publishers": ["Thomas Nelson and Sons"], "pagination": "xvi, 521 p.", "last_modified": {"type": "/type/datetime", "value": "2010-08-17T23:54:37.556294"}, "title": "Old Mortality", "series": ["New Century library. The works of Sir Walter Scott, bart, vol. V"], "number_of_pages": 521, "created": {"type": "/type/datetime", "value": "2008-09-26T02:41:21.429900"}, "languages": [{"key": "/languages/eng"}], "subjects": ["Covenanters -- Fiction", "Bothwell Bridge, Battle of, Scotland, 1679 -- Fiction"], "publish_date": "1906", "publish_country": "xx ", "key": "/books/OL16791965M", "authors": [{"key": "/authors/OL75235A"}], "by_statement": "by Sir Walter Scott, Bart", "publish_places": ["London, New York"], "works": [{"key": "/works/OL15336690W"}], "type": {"key": "/type/edition"}, "latest_revision": 4, "revision": 4}
....

Some of these are old, last touched by WorkBot in 2010.

It looks like the merge code is not tidying up all editions and leaving dangling references to the redirect works.

Previously, it was a safe assumption that all work ids in editions were real works, not redirects, and redirects were for works with no linked editions. Infogami doesn't help things by having no built in way to handle redirects transparently and follow them automatically.

LeadSongDog commented 3 years ago

It is not helpful to have multiple author records for Sir Walter Scott. It is even worse to have edition records explicitly and indelibly linking to the wrong author record, different from the one given in the work record. This is a well known issue: see #2625 and #5265. Edition records should not show author links unless they can stay in sync with the work records.

seabelis commented 3 years ago

@hornc The merge script only migrates 50 editions. In cases where there are more than 50, I usually use a separate script to first migrate the editions before running the merge; looks like maybe I missed this one. I've seen this before with some older merges by bots and was able to reverse them by rolling back to an earlier version of the work, but that does not seem possible in this case. Related to https://github.com/internetarchive/openlibrary/issues/1676

I'm pretty sure I've seen cases where after reverting the work, the only edition(s) was the new import. It's hard to search for these examples, but may explain why, when sometimes searching for an ISBN, I get an error message instead of the usual "No results found." I was told that this error was just an un-graceful way of saying "No results found", but maybe that isn't actually the case.

I'll post the search next time I encounter it.

seabelis commented 3 years ago

Was able to revert https://openlibrary.org/works/OL15336690W by editing the .yml

seabelis commented 3 years ago

This merge happened way back in 2010. https://openlibrary.org/works/OL10614095W/Lone_Eagle?m=history

seabelis commented 3 years ago

And now trying to re-merge I get an error, Unexpected token U in JSON at position 0 This work did not have 50 editions, or if it did, they did not migrate, so it's unclear why this merge still left editions remaining on the dupe.

I think the error is probably due to a redirected author ID associated with one of the editions.

seabelis commented 3 years ago

And as @LeadSongDog points out, all of these seem to be missing titles.

seabelis commented 3 years ago

And again, this one was merged back in 2011, but with remaining editions. Same error when trying to merge now. https://openlibrary.org/works/OL10614091W/untitled

Another fails when trying to redirect. https://openlibrary.org/works/OL10614096W/Mixed_blessings?m=history

seabelis commented 3 years ago

Also, none of those in @cdrini 's list are actually redirecting until they are reverted and re-merged. So something is going wrong there.

seabelis commented 3 years ago

So I think these can all be reverted by changing the .yml from type/redirect to type/work and removing the location. This can be done in bulk? @hornc @cdrini @BharatKalluri

In some cases, the author ID also throws an error, but those would have to be updated according to the specific author IDs.

cdrini commented 2 years ago

@seabelis I think that's a good approach; they'll also needed a title field restored. This shouldn't be too hard to run...

cdrini commented 2 years ago

This query I posted is a little problematic; http://openlibrary.org/query.json?type=/type/redirect&subjects~=*&limit=100 . It seems to duplicate results for some reason. Let me do a count to see how many there actually are

curl -L 'https://archive.org/download/ol_dump_2022-03-02/ol_dump_redirects_2022-03-02.txt.gz' | zcat | grep -F 'subjects' | wc -l

301 matches! Not too bad

hornc commented 2 years ago

@seabelis wouldn't it be easier to find the last version before the redirect and revert to that?

You can get to the history page UI by appending ?m=history e.g.

https://openlibrary.org/works/OL69612W?m=history

The root issue here seems to be that when the new import matching process is running, stale results are being returned and identifiers are being matched on old cached or index items that have since been turned into redirects.

edit: I re-read my own comment above and the cause is the import matching process is picking up the redirected work from existing editions which reference it, so not all editions were properly updated during the merge.

It looks like the problem which needs solving is: re-merging partially merged works

i.e. a work which is a redirect and has many editions pointing to it needs to be merged (again) with another work record.

Does that help define the problem? Can the existing merge works tool be modified to merge these kinds of works too?

cdrini commented 2 years ago

That's 100% part of it! We also don't want to lose the edits ImportBot made to the records, so I think what might be best is to "soft" revert setting just the title and removing type/redirect, then create a big list of links to the Merge UI for librarians to merge these again. Although note it currently doesn't support merging a work with 50+ editions into another work.

Put together a colab of what I think should do the trick! https://colab.research.google.com/drive/1FcGV3CafYKgBQ4848pfv_ru31f3OYiMz#scrollTo=8Pha5F5BVuyj

cdrini commented 2 years ago

Alright kicking off a sample of that job; the number of works affected is <300 ; I think some of these are residues from "Open Library Work Bot" in the early 2010s :) https://colab.research.google.com/drive/1FcGV3CafYKgBQ4848pfv_ru31f3OYiMz#scrollTo=hOIREPs9a5rC

So I'm only going to be "soft reverting" work keys that have editions associated with them :+1: That's 189 works

cdrini commented 2 years ago

Alright completed! Here are the results: https://docs.google.com/spreadsheets/d/1Qu7tlmyaQPUib-GIQBCuAPRboKzZpU2TsTZ6xMPkQ-Y/edit#gid=0

cdrini commented 4 months ago

It seems like this also caused trouble with author records: https://openlibrary.org/authors/OL28127A.json?v=10