Open george08 opened 13 years ago
Duplicate variations are cool. That there are so many of them is a data problem, not a search problem per se. Different spellings may make a huge difference and so do Paris (place) vs. Paris (person).
A serious problem I face here is getting 0 results when I click the New York example. Somehow OL doesn't understand the space, neither kept as %20
nor translated to +
. Only when I explicitly write new_york
I get results.
It appears my problem was solved by https://github.com/internetarchive/openlibrary/commit/06b0da28c08257a885e6535865c49098bcdd2852, thanks Anand!
I now realise the problem with duplicates/variations is a bit different than I thought. The search query for New York yields subjects like New York University. On page 1 are three listings for New York University:
The URLs for these results are the same, which means the number of books is wrong in at least two of these three results. The 'org' results are probably books in which the original records have the subject in a special (sub)field, but there is no way normal users can edit this in individual books.
I'm pretty sure this is a problem with how Solr gets updated.
In https://openlibrary.org/search/subjects?q=University+York the subject and org versions with the same URLs show the same counts...
but https://openlibrary.org/search/subjects?q=University+London has London University College 40 books, subject London University College 39 books, org
and the same URLs, so this is still an issue.
SOLR updating issue?
@hornc , could you please opine as to the appropriate priority of this, by setting a label?
George's query returns 0 results (problem solved? :) ) Likely related to #322
The general problem is that subjects need to be objects, not string, and have aliases, associated metadata, etc.
Yes, I think this can be closed. #322 is a more specific and current issue that can be dealt with independently. Any other current Subject related issues should be raised separately with current examples.
I think this is still a valid issue. Although George's original query is broken due to a different bug (#322), the problem she reported isn't fixed. The queries that @hornc posted here in 2017: https://github.com/internetarchive/openlibrary/issues/65#issuecomment-340944840 still demonstrate the problem.
The first three hits:
York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject
resolve to only 2 uniq URLs which differ only be a single trailing period. This is precisely the same as Ben's example from 2014 (except there the difference was a single letter case difference).
Additionally, the two subject URLs have work counts of 29 and 19, not 40 and 24, respectively, so the counts reported on the search results page is incorrect, but I'm guessing that if we fix the duplication, the counts will take care of themselves, so why don't we focus this issue on that? It's also the problem that George reported in 2011 and was confirmed in 2014 and again in 2017.
There are also additional search hits which should be merged:
Toronto York University 4 books, subject York University (Toronto) 1 book, org York University, Toronto 1 book, subject
but if we take care of the simple, common cases to start, we'll have a 90% solution.
On a more general note, I think favoring "current" (ie more modern) reports over historical historical ones is a bad idea in general because it loses the history of research that was done and obscures how long the problem has existed (8 years in this case).
related #188
I'm trying to figure out what the remaining issue is here -- it seems the latest example is https://openlibrary.org/search/subjects?q=University+York
and in the first three hits:
York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject
The issue is why there are two separate rows for:
York University (Toronto, Ont.). 24 books, org York University (Toronto, Ont.). 24 books, subject
A book that has this subject https://openlibrary.org/works/OL19055710W.json
shows there is only one entry for York University (Toronto, Ont.).
under subjects
Why is it showing in subject search results as an org
and a subject
?
Investigation task:
org
and subject
on the subject results page.Subject search results page is: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/templates/search/subjects.html
Alternatively, skip investigation, and fix display:
At first glance it looks like if $key
or $n
have been seen before, they shouldn't be displayed again.
Ideally the most specifc catgeory should be used over subject
, so that the results are:
York University (Toronto, Ont.) 40 books, subject York University (Toronto, Ont.). 24 books, org
The difference in the trailing period is a separate data issue, and subjects should go through some normalisation on import. (I believe they do on the import API path).
I believe it was specifically the normalization/clustering of subjects that George's original issue was about.
The org/place/time subjects are used as search facets. If you think there's an issue there, we should create a separate issue to address it and not conflate it with the original problem.
I think this issue needs to be renamed and needs a clear scope (how do we know it's complete) if it's going to stay open. @tfmorris would you like to suggest a more useful title and description for this issue? I don't understand the problem well enough other than... There is a discrepancy between subject search results for subject v. org? Is there any proposed solution? Do we know where to look / what code is doing something wrong?
We should be able to make significant progress on this through #7486 of #2819.
Extending the ILE admin blue bar should allow librarians to search for works by subject and update them in bulk.
It's possible we'll also want to use scripts for duplicate subjects applied to thousands+ of books.
The good news is, we now have a subject "object" which we're calling a Tag. See: #7928
We decided to make a new object
called a Tag (and leave subjects as is) because there are many things we want a Tag object for that are not limited to "subject".
The current strategy is to use subject strings as a mechanism to pull a corresponding Tag object from the db. This process and the corresponding features and definitions are described in the breakdown section here.
Even with a way to promote subject strings into Tag objects, there are still several challenges around duplication and naming. There are two things I'd like to briefly discuss:
:
prefixes as a mechanism of "typing" subjects is another thing we've talked about, and designing Tags so they have a type
(like subject, place, content-warning, etc).As a result of the progress we've made this year, I'm going to rename this issue to make it something more actionable:
"Design process & UI for merging duplicate Subjects"
Ideally, the solution would also leverage and extend our existing Librarian Merge Queue: https://openlibrary.org/merges to include merge requests for a list of subjects.
Merge
subject
type and should also have a type
filter and dropdown with checkboxes that allows librarians to select and see Work and/or Author and/or subject merge requests. Similar to how the URL may specify e.g. ?reviewer=librarian123
, we should have types=authors,works,subjects
(default) -- very similar change to #8272
Can we please get rid of the duplicate variations on search results pages for subjects?
E.g. http://openlibrary.org/search/subjects?q=New%20York
⚠️ EDIT: Administrative edit by @mekarpeles -- please jump to https://github.com/internetarchive/openlibrary/issues/65#issuecomment-1712067036 to see recent context + proposal for implementing a solution to this issue.