internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5k stars 1.26k forks source link

malformed genre entries (example for scifi), break search cta from homepage #4483

Open jzacsh opened 3 years ago

jzacsh commented 3 years ago

470~ works incorrectly listed in a nonsense genre Fiction, science fiction, space opera (those are three different genres) - see here: https://openlibrary.org/subjects/fiction_science_fiction_space_opera

Evidence / Screenshot (if possible)

Screenshot 2021-01-24 at 13 57 38

Relevant url?

https://openlibrary.org/subjects/fiction_science_fiction_space_opera

Steps to Reproduce

  1. visit home page
  2. click science fiction [14,712 books] icon in the big "browse by subject" banner
  3. search for an author you know writes scifi, eg "Nnedi Okorafor"
  4. bug: no results - start debugging, since you're surprised:

note: this is the same for Consider Phlebas but as you can see it's overcome by a correct genre being added in beside the nonsense genre (so the book shows up facet searches as expected from the home page).

Expected/Actual

root cause of the bug is step 4f being wrong (and this is true of many books; see 4g):

Details

Proposal & Constraints

  1. instance fix: mass-fix of all books in this category (update them all to delete this category and insert 3 new categories of Fiction, science fiction, space opera)
    • 1a) look for other instances (eg: maybe do a database search for all works with a subject field containing quotes?) and fix those just the same.
  2. systemic fix: I'd guess it's unlikely all 400 books got this bug from a single user's bad import csv - my guess is there's something broader perpetuating the bug (eg: an autocomplete that other users click in a dropdown, not knowing the selection is malformed data).
    • the systemic fix would be to make sure both import-logic as well as single-entry form-validation warn and try to help mitigate these kinds of errors (commas nested inside quotes) and either automatically strip the quotes or help users determine if an internal comma is really desired.
seabelis commented 2 weeks ago

Many subject tags are imported with the record and not in our direct control. There are over 2,000 items with this tag alone. This is too many to fix manually. Maybe someone would like to try to remove this programmatically?

jzacsh commented 2 weeks ago

@seabelis how about proposal 1a, or 2: do those sound possible to you?

While I'm not available to work on this any time soon, I'd guess the fastest help to that next contributor: pointers to where solutions 1,1a, 2 might plausibly start off in this codebase, or which proposals the team prefers/dislikes.


Maybe someone would like to try to remove this programmatically?

Edit: Also I should point out that proposal 1 doesn't have to be a mass removal (in fact that might leave the buggy search experience still intact for many books), but could be a re-insertion of the intended/fixed values.

jzacsh commented 2 weeks ago

Oh interestingly: #7904 seems to be a newer (2years later) rethink of the data structure involved here. I'd guess it's important that whatever is proposed here should be coordinated closely with those folks.

tfmorris commented 2 days ago

This subject was imported from Better World Books which is infamous for providing garbage metadata, but we've been unable to convince the powers that be to stop importing from it. Obviously having subjects with embedded commas is incompatible with using commas as the delimiter in the subject data entry field, so they would need to be escaped in some way, but it is likely that it was originally intended to be the hierarchical genre "Fiction / Science Fiction / Space Opera" as you can see from the Library of Congress hierarchy here: https://id.loc.gov/authorities/genreForms/gf2014026551.html You can also see it in textual form rather than "broader" links at the bottom of this MARC record: https://openlibrary.org/show-records/marc_loc_2016/BooksAll.2016.part41.utf8:166410102:1707

You can see all the different ways that "space opera" is spelled on OpenLibrary with different hierarchy delimiters here: https://openlibrary.org/search/subjects?q=space+opera

My feature request (#2819) to make subjects first class objects instead of strings was an attempt to bring some order to this as well as allow links to things like LCSH, FAST, and Wikidata. It would also support internationalization for things like Novelas del espacio

The best fix would be to stop importing from BWB, but failing that all the bad metadata should be filtered out (which is probably effectively the same thing).