internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.26k stars 1.4k forks source link

Design librarian process & UI for merging duplicate Subjects #65

Open george08 opened 13 years ago

george08 commented 13 years ago

Can we please get rid of the duplicate variations on search results pages for subjects?

E.g. http://openlibrary.org/search/subjects?q=New%20York

⚠️ EDIT: Administrative edit by @mekarpeles -- please jump to https://github.com/internetarchive/openlibrary/issues/65#issuecomment-1712067036 to see recent context + proposal for implementing a solution to this issue.

bencomp commented 10 years ago

Duplicate variations are cool. That there are so many of them is a data problem, not a search problem per se. Different spellings may make a huge difference and so do Paris (place) vs. Paris (person).

A serious problem I face here is getting 0 results when I click the New York example. Somehow OL doesn't understand the space, neither kept as %20 nor translated to +. Only when I explicitly write new_york I get results.

bencomp commented 10 years ago

It appears my problem was solved by https://github.com/internetarchive/openlibrary/commit/06b0da28c08257a885e6535865c49098bcdd2852, thanks Anand!

I now realise the problem with duplicates/variations is a bit different than I thought. The search query for New York yields subjects like New York University. On page 1 are three listings for New York University:

The URLs for these results are the same, which means the number of books is wrong in at least two of these three results. The 'org' results are probably books in which the original records have the subject in a special (sub)field, but there is no way normal users can edit this in individual books.

I'm pretty sure this is a problem with how Solr gets updated.

hornc commented 7 years ago

In https://openlibrary.org/search/subjects?q=University+York the subject and org versions with the same URLs show the same counts...

but https://openlibrary.org/search/subjects?q=University+London has London University College 40 books, subject London University College 39 books, org

and the same URLs, so this is still an issue.

SOLR updating issue?

brad2014 commented 5 years ago

@hornc , could you please opine as to the appropriate priority of this, by setting a label?

tfmorris commented 5 years ago

George's query returns 0 results (problem solved? :) ) Likely related to #322

The general problem is that subjects need to be objects, not string, and have aliases, associated metadata, etc.

hornc commented 5 years ago

Yes, I think this can be closed. #322 is a more specific and current issue that can be dealt with independently. Any other current Subject related issues should be raised separately with current examples.

tfmorris commented 5 years ago

I think this is still a valid issue. Although George's original query is broken due to a different bug (#322), the problem she reported isn't fixed. The queries that @hornc posted here in 2017: https://github.com/internetarchive/openlibrary/issues/65#issuecomment-340944840 still demonstrate the problem.

The first three hits:

York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject

resolve to only 2 uniq URLs which differ only be a single trailing period. This is precisely the same as Ben's example from 2014 (except there the difference was a single letter case difference).

Additionally, the two subject URLs have work counts of 29 and 19, not 40 and 24, respectively, so the counts reported on the search results page is incorrect, but I'm guessing that if we fix the duplication, the counts will take care of themselves, so why don't we focus this issue on that? It's also the problem that George reported in 2011 and was confirmed in 2014 and again in 2017.

There are also additional search hits which should be merged:

Toronto York University 4 books, subject York University (Toronto) 1 book, org York University, Toronto 1 book, subject

but if we take care of the simple, common cases to start, we'll have a 90% solution.

On a more general note, I think favoring "current" (ie more modern) reports over historical historical ones is a bad idea in general because it loses the history of research that was done and obscures how long the problem has existed (8 years in this case).

mekarpeles commented 5 years ago

related #188

hornc commented 5 years ago

I'm trying to figure out what the remaining issue is here -- it seems the latest example is https://openlibrary.org/search/subjects?q=University+York

and in the first three hits:

York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject

The issue is why there are two separate rows for:

York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject

A book that has this subject https://openlibrary.org/works/OL19055710W.json shows there is only one entry for York University (Toronto, Ont.). under subjects

Why is it showing in subject search results as an org and a subject?

Investigation task:

Subject search results page is: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/templates/search/subjects.html

Alternatively, skip investigation, and fix display:

At first glance it looks like if $key or $n have been seen before, they shouldn't be displayed again. Ideally the most specifc catgeory should be used over subject, so that the results are:

York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org

The difference in the trailing period is a separate data issue, and subjects should go through some normalisation on import. (I believe they do on the import API path).

https://github.com/internetarchive/openlibrary/blob/a53f9018ed388449ba0c998a1880a37f5dafcbe8/openlibrary/templates/search/subjects.html#L44

tfmorris commented 5 years ago

I believe it was specifically the normalization/clustering of subjects that George's original issue was about.

The org/place/time subjects are used as search facets. If you think there's an issue there, we should create a separate issue to address it and not conflate it with the original problem.

mekarpeles commented 4 years ago

I think this issue needs to be renamed and needs a clear scope (how do we know it's complete) if it's going to stay open. @tfmorris would you like to suggest a more useful title and description for this issue? I don't understand the problem well enough other than... There is a discrepancy between subject search results for subject v. org? Is there any proposed solution? Do we know where to look / what code is doing something wrong?

mekarpeles commented 1 year ago

We should be able to make significant progress on this through #7486 of #2819.

Extending the ILE admin blue bar should allow librarians to search for works by subject and update them in bulk.

It's possible we'll also want to use scripts for duplicate subjects applied to thousands+ of books.

mekarpeles commented 1 year ago

Updates

The good news is, we now have a subject "object" which we're calling a Tag. See: #7928

We decided to make a new object called a Tag (and leave subjects as is) because there are many things we want a Tag object for that are not limited to "subject".

The current strategy is to use subject strings as a mechanism to pull a corresponding Tag object from the db. This process and the corresponding features and definitions are described in the breakdown section here.

Challenges

Even with a way to promote subject strings into Tag objects, there are still several challenges around duplication and naming. There are two things I'd like to briefly discuss:

  1. Some human or librarian-editable mechanism (e.g. a UI) should exist that allows us to merge subjects or tags (and have the changes apply to all relevant Works, etc.)
  2. Leaning into the usage of: prefixes as a mechanism of "typing" subjects is another thing we've talked about, and designing Tags so they have a type (like subject, place, content-warning, etc).

Proposed Solution

As a result of the progress we've made this year, I'm going to rename this issue to make it something more actionable:

"Design process & UI for merging duplicate Subjects"

Ideally, the solution would also leverage and extend our existing Librarian Merge Queue: https://openlibrary.org/merges to include merge requests for a list of subjects.