Provide endpoint to allow librarians to view solr index

internetarchive / openlibrary

One webpage for every book ever published!

https://openlibrary.org

GNU Affero General Public License v3.0

5k stars 1.26k forks source link

Provide endpoint to allow librarians to view solr index #2746

Open LeadSongDog opened 4 years ago

LeadSongDog commented 4 years ago

When changing a record, there's currently no apparent way to know what will make it into the Solr index. This leaves one guessing what effect to expect.

Describe the solution you'd like

On saving, show not just the edited record, but also what is being sent to Solr.

Proposal & Constraints

Additional context

Stakeholders

cdrini commented 4 years ago

That's a really good idea; it would be really helpful in debugging! I'd recommend making a "secret" endpoint instead (maybe https://openlibrary.org/works/OL1100007W/Le_tour_du_monde_en_quatre-vingts_jours/_solr_record ); putting it directly in the UI might be a bit too much, since it's not permanent data (like most of the wiki items which have e.g. a history) and not "directly" editable. That could return the corresponding solr record for that item: http://server.openjournal.foundation:8984/solr/select/?q=key%3A%2Fworks%2FOL1100007W&version=2.2&start=0&rows=10&indent=on

LeadSongDog commented 4 years ago

Thank you @cdrini, that's the first time I've ever actually seen a solr record. It is a very instructive example, that helps to compare the information on different editions, highlighting some anomalies, and old annoyances such as:

one ISBN value of "$2.30"
audiobooks show publisher as "Brilliance Audio on CD" and "Brilliance Audio on cassette" vice just "Brilliance Audio"
many contributor names are still in inverted form, such as "Moser, Barry, ill."
some contributors are actually publishers, such as the "Limited Editions Club"
many fake subjects are still present such as "In library", "Popular Print Disabled Books", "Large type books", "Protected DAISY", "Internet Archive Wishlist", "Translations from French" and "Translations into ..."
publication places are still commonly mal-abbreviated, such as "Me" for "Maine", "Pa" for "Pennsylvania", and "N.Y" for "New York"
many edition titles are still including parenthetic series or edition names

tfmorris commented 4 years ago

This request seems to be based on the premise that the system is, and will remain, broken. The answer to "What gets sent to Solr?" is "everything that should be." If that's ever not the case, then it's a bug which needs to be fixed.

Also, editing a single object/record results in potentially many Solr documents being updated, at some point in the future, so there's not a 1:1 correspondence, and it's not immediate.

LeadSongDog commented 4 years ago

@tfmorris No, just the reverse premise: the only way we'll ever know that it is fixed is to be able to see what it is doing. How, without seeing what is sent to the index, could we ever say "everything that should be, is"? Conversely how could we know that search functions as expected without first knowing what is indexed?

If there is another way to gain the necessary transparency/visibility, sure, let's consider it.

BrittanyBunk commented 4 years ago

I've been seeing a lot of these bot mistakes too. I think it's just a matter of having a set format and telling the bots whichever format they have close to that, revert it to the standard format of the website.

I see the difficulty in not being able to see the bots themselves, like https://openlibrary.org/people/ImportBot. However, I don't think you really need to look at their codes in order to fix what they're doing. Wouldn't just creating another bot that corrects grammar in already-made records work better?

LeadSongDog commented 4 years ago

@BrittanyBunk I'm not suggesting we need users to be able or willing to examine the bots, but the data that is cached for a given work, edition, or author record should in some fashion be made available so that it is possible for us to determine whether it correctly reflects what is displayed in the user interface and vice versa.

BrittanyBunk commented 4 years ago

@LeadSongDog No, I was - by stating the options we have. Good thought. I will add onto that in that - while people don't need to see the lines of code a bot works by, it's good to have a list of what the bot itself does - and an ability for people to say if it's doing its job correctly, and what more bots we need (if the current ones aren't enough) - to be able to have better bots is all.