Closed bencomp closed 3 years ago
The numbers for the September 30, 2013 dump are 5.3 million editions (21%) without a link to a work. That's a lot! What do we need to make progress on this?
The example that I gave in my email http://openlibrary.org/books/OL5204939M/The_complete_old_English_sheepdog still has no work record so I don't think the situation has improved since then.
We also apparently have works without editions as described in issue #44
Editing a work-less edition now automatically creates a work for it on the fly, so this will get cleaned up if a user touches them, but as of October 31, 2015, there were still 5.2 million workless editions from a total of 25.1 million (and 170 editions had two works linked!). Since May 2012, the percentage has dropped from 21.7% to 20.9%.
I think search has been changed so that these at least show up on author's pages and in search results, but I'm not 100% certain.
@jessamynwest: I'd love your feedback on the possible implications of auto-creating works for the 21% of editions that lack one. (I'm particularly wary of the possibility of creating loads of duplicate works, so it would be worth our while to spend some time coming up with an intelligent algorithm.)
Anyone else can chime in too!
Another example at: https://openlibrary.org/books/ia:bub_gb_otUYAAAAYAAJ/International_exchange_list_of_the_Smithsonian_Institution_corrected_to_July_1897
Over 200000 uploaded google books records by this user may be similar and need correction.
The Jira has a pretty good workflow for how the process could work. We have a user who is on the mailing lists who suggested a programmatic solution. If we're sticking with the Work/Edition model (and I think we absolutely should) I think it's a no-brainer to do this. The side effect should also be that all the records on OL are editable, right now the workless editions that were auto-imported aren't editable which is not cool from a UX perspective.
The example above, https://openlibrary.org/books/ia:bub_gb_otUYAAAAYAAJ , has an OL work page for (what I think is) the same work + edition here: https://openlibrary.org/works/OL13807452W , which links back to a scan of a different copy of the "corrected to July 1897" edition on IA: https://archive.org/details/internationalex01instgoog One has a Harvard bookplate, the other is from the University of Michigan.
Both ia:xxxx
workless editions I checked recently have happened to have other OL entries created from different IA imports.
I think that means it is safe to delete those specifc ia:xxxx
entries from OL, but the duplicate IA details pages should be updated to point back to the one OL edition page?
I have no idea how many of the other ia:xxxx
entries will turn out to be dupes too. Is there a way to get a list of them? The user shows up for me as anonymous. If I could get a full list I could investigate further.
The ia:xxxx
editions can be edited by modifying the URL by tacking ?m=edit
to the end, e.g.:
https://openlibrary.org/books/ia:bub_gb_otUYAAAAYAAJ?m=edit I have not tried making a change to see if the auto work creation happens, because I believe the two examples I looked at should be deleted rather than have works added, but I am very willing to be guided on the correct action here :)
see also https://github.com/internetarchive/openlibrary/issues/321
That's a really interesting point. If these items are editable, why do they not have an EDIT button. Maybe there's a script-y way to at least do that and then get them into the larger queue of "Things that need deduplicating"
Per tfmorris "Editing a work-less edition now automatically creates a work for it on the fly" this is more a problem than a solution, creating near-duplicate works. What should happen instead is that a fuzzy search based on the title and author runs to identify existing works similar to the edition, then the editor is given a picklist of nearest-matching work records (plus a "new work" option). One of these is then linked to the edition record.
Stranger and stranger. https://openlibrary.org/works/OL23966006M/Hong_lou_meng is apparently redirected to https://openlibrary.org/books/OL23966006M/Hong_lou_meng but this shows as an edition with no work (and hence no author). Of course, that text has many work-records. The main one seems to be https://openlibrary.org/works/OL16280308W/Hong_lou_meng which shows many editions, but other work records show only 1-3 editions. Of course these should all be merged to a single work, but how are they arising to begin with? Should the "Add a book" process at https://openlibrary.org/books/add not be more aggressive in looking for existing works before creating a new one?
@LeadSongDog: Yeah, we've got to tackle this whole dupe issue in a more comprehensive way. Yes, we should definitely be more aggressive in looking for existing works. We've preparing for a big import of new books, and for these, we are making sure that there is no overlap of ISBNs -- or even associated ISBN's -- using OCLC's xISBN service, and no title matches (as found by OL's SOLR search).
@bfalling It is, after all, a commonly overlooked key to life management: "When you find yourself in a hole, stop digging." There's a considerable between the problems of redundant authors and redundant works. The cuteness of the "add a book" functionality is perhaps part of the issue: it should not invoke creation of the primary author record, but rather should link to an author record that already exists. The way to ensure that is to insist the user start by chosing that author before adding anything else to the work record.
Under OL5991242A there are presently 26 edition records masquerading as works, e.g. https://openlibrary.org/works/OL20482081M/The_Edinburgh_Review_Or_Critical_Journal Note the M type record in the /works/ path. Note also that no author is identified.
All should have been created under an existing edition, such as https://openlibrary.org/works/OL13100980W/The_Edinburgh_Review_Or_Critical_Journal For long-running serials such as this, editors change over time. There should be a clean way to handle that circumstance, for instance by use of "Edinburgh review editors" as the author entry.
Here's an interesting case. https://openlibrary.org/show-records/marc_university_of_toronto/uoft.marc:1015940114:568 was used to create an orphaned edition record https://openlibrary.org/books/OL16898832M/The_American_Scenic_and_Historic_Preservation_Society which evidently was never linked with https://openlibrary.org/works/OL13809514W/The_American_scenic_and_historic_preservation_society.. (or indeed any other work-record) by https://openlibrary.org/authors/OL1816576A/American_Scenic_and_Historic_Preservation_Society. As that MARC record had a variant publisher spelling, https://openlibrary.org/publishers/Amer._Scenic_and_Hist._Preserv._Soc. was also created, but without any work-record, that too shows as an orphan, with no linked work.
Would it not be simple to have the "add an edition" functionality accept an OLnnnnM identifier for an existing edition record?
@LeadSongDog, there a hack to achieve that if you are an admin. Go to /books/OLxxxM.yml?m=edit and add works entry.
On 10:02PM, Fri, Oct 7, 2016 LeadSongDog notifications@github.com wrote:
Would it not be simple to have the "add an edition" functionality accept an OLnnnnM identifier for an existing edition record?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/internetarchive/openlibrary/issues/153#issuecomment-252299239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAdkXC477T0JhhRM7fcpiihUDt0KRh_ks5qxnQVgaJpZM4AHNq9 .
@bencomp I remember fixing thus issue without creating work records. It is possible to add the editions without work to search engine and they appear in search results. IIRC they will appear on the author pages.
I don't think creating a real work record for each such entry is a good idea.
On 3:21PM, Sun, Oct 9, 2016 Anand Chitipothu anandology@gmail.com wrote:
@LeadSongDog, there a hack to achieve that if you are an admin. Go to /books/OLxxxM.yml?m=edit and add works entry.
On 10:02PM, Fri, Oct 7, 2016 LeadSongDog notifications@github.com wrote:
Would it not be simple to have the "add an edition" functionality accept an OLnnnnM identifier for an existing edition record?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/internetarchive/openlibrary/issues/153#issuecomment-252299239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAdkXC477T0JhhRM7fcpiihUDt0KRh_ks5qxnQVgaJpZM4AHNq9 .
@anandology Thanks, but that's not really useful for something this common. It's going to have to be easy for any user, if not bot driven.
Can someone please explain why we would ever want editions without works? I thought the whole relationship of author-to-work-to-edition was an absolute requirement.
I think every edition should have a work, at least under the current schema we have for representing books.
@anandology: Just curious, why don't you think creating a work record for each "orphan" edition is a good idea?
@bfalling If such were to be created, they would include many duplicate or multiple records for editions of the same work. These would then have to be de-duplicated afterwards. Searching first for existing work records avoids this problem.
Sorry, my fault. I think I'm creating or contributing to a misunderstanding.
@LeadSongDog: Yes, I totally agree that we should avoid creating duplicate Works. My inclination would be to attach an orphan Edition to the appropriate parent Work, if one exists and can be found, and only create a new parent Work otherwise.
@anandology: Are saying that orphan Editions shouldn't need to have an associated Work? Or simply echoing the warning not to automatically create new Works for orphaned Editions when appropriate parent Works might already exist? Or other?
@bfalling, @jessamynwest : It would help if it were possible, when editing the identifier list for such a solitary edition, to add or change the OLxxxxxW work identifier to one that already exists.
cc: @hornc
Treating this as a high level task which entails many different parts. Other subtasks should be referenced as children of this one.
The first task was to create Works for all those orphaned Editions which had archive.org ebooks/scans associated with them (i.e. readable / borrowable / actionable catalog items). As far as I remember, this is complete :)
In 2018 Q1, our goal is to modify how Open Library serves book pages -- a la https://github.com/internetarchive/openlibrary/issues/684 -- in a way which will require all edition pages to have works in order to be accessed.
Flagging @hornc as the person who has been valiantly leading these efforts
thanks @mekarpeles. I have a plan page on the wiki here https://github.com/internetarchive/openlibrary/wiki/Orphaned-Editions-Planning that is mostly current, but needs a review and new push for 2018.
Ok, #346 has been closed as a partial dupe of this, but we are still seeing these crazy /works/OLnnnnM/ filepaths, such as the first twenty shown at https://openlibrary.org/authors/OL3528140A/ (the last three are /works/OLnnnnW/ as expected, but they have a different problem: no editions linked!).
Let's please use this as a high-level issue for organizing [checklist of] all "create works for workless editions" issues.
I have updated https://github.com/internetarchive/openlibrary/wiki/Orphaned-Editions-Planning for 2018, and I believe there is a way forward to fix this for good using the re-import process I have been working on.
I'd like to see this issue closed soon. It's been slow progress, but I think we have the pieces in place to reimport and fix these at scale.
There are some placeholders in the original text for the wiki page about detailed stats for the source of the bad imports. I think it would be useful to have a stratified analysis of the errors to know where to focus the effort. Perhaps there was an offline analysis that informed the current "reimport stuff" decision, but that'd be useful to see (as well as what percentage of the problem it's likely to address).
I suspect that not all the workless editions are due to import issues, but we should also make sure that we reimport anything that is import related, e.g. bad characters in titles https://openlibrary.org/books/OL13507123M/Madame_de_Sta%C2%A9%C2%B7el_et_Napol%C2%A9%C3%98eon
@mekarpeles Is this umbrella issue to cover #561, #1586 and #1808 too? There's really little excuse for the ongoing chaos in handling serials, series, and anthology editions.
@hornc how are we doing on orphaned editions, in general? Any updates to this ticket?
Is this umbrella issue to cover #561, #1586 and #1808 too?
No.
This issue feels fairly critical to the mission of Open Library, so I'm labeling with a high priority. @hornc Let me know if you agree or disagree
Solving this issue will solve #1075
More examples of malformed paths found among the listings under https://openlibrary.org/authors/OL858999A/David_Burnett_undifferentiated:
https://openlibrary.org/works/OL13350811M https://openlibrary.org/works/OL13350814M https://openlibrary.org/works/OL13350821M https://openlibrary.org/works/OL13350834M https://openlibrary.org/works/OL9208045M https://openlibrary.org/works/OL9208042M https://openlibrary.org/works/OL13350825M https://openlibrary.org/works/OL13350819M and https://openlibrary.org/works/OL11835283M
Curiously, all these seem to resolve to the intended target edition
@hornc got this number down to ~2M orphans and most of the egregious cases have been dealt with: https://github.com/internetarchive/openlibrary/wiki/Orphaned-Editions-Planning.
As it stands, whenever an orphaned edition is edited, a work is created for it.
Going to close this issue as things should become eventually consistent. If someone wants to create an https://github.com/internetarchive/openlibrary-client https://github.com/internetarchive/openlibrary-bots bot to seek out orphaned editions and touch them / create works, that wouldn't be a terrible idea.
In May 2012, about 21.7% of the Edition records had no link to a Work. This was discussed on the tech list: http://www.mail-archive.com/ol-tech@archive.org/msg00624.html
As Editions without Work are treated differently than Editions with Work, I believe this needs to be solved.
Editions without Work:
Setting up the search engine to include Editions is already an issue (#114), but that does not change that e.g. authors can be found in the Work, or Edition record and that if they are in both, during an Author merge the Edition may not be updated. It's a hassle.
Ideally, new Works are only created when there is no existing Work yet. (We don't need, say, another twenty Romeo and Juliet Work records.) If there is a Work, the Edition should link to that. Realistically, it is hard to find matching Works for every Edition. A Work-merger may be able to do that later. We could just run one script that creates a new Work for every Work-less Edition.