internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.17k stars 1.35k forks source link

Authors not findable using Search #699

Closed tfmorris closed 3 years ago

tfmorris commented 6 years ago

None of these authors are findable using search even though (many) author records exist for them.

I thought perhaps it was associated with long authors, but this one is findable even though it's longer than the Committee on the Pacific Railroad. Princeton University. Dept. of Economics and Social Institutions. Industrial Relations Section.

Here's the list of author records for one of the unsearchable names: /authors/OL4620383A United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation
/authors/OL4620614A /authors/OL4625592A /authors/OL4620175A /authors/OL4620217A /authors/OL4625064A /authors/OL48266A /authors/OL4626004A /authors/OL4625755A /authors/OL4625259A /authors/OL4625904A /authors/OL4625065A /authors/OL4625754A /authors/OL4620231A /authors/OL4620213A /authors/OL4623159A /authors/OL4625899A

cdrini commented 6 years ago

Example: https://openlibrary.org/search/authors?q=United+States+Congress+House+Aviation has 0 results but should yield https://openlibrary.org/authors/OL4620383A (United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation)

tfmorris commented 6 years ago

A shorter example is "New Hampshire. Council" which fails to return any of these records:

/authors/OL4896972A New Hampshire. Council      
/authors/OL4629754A New Hampshire. Council.     
/authors/OL4896987A New Hampshire. Council      
/authors/OL4896997A New Hampshire. Council      
/authors/OL4896999A New Hampshire. Council      
/authors/OL4896975A New Hampshire. Council      
/authors/OL4896989A New Hampshire. Council      
/authors/OL4896984A New Hampshire. Council      
/authors/OL4896980A New Hampshire. Council      
/authors/OL4896996A New Hampshire. Council      
/authors/OL2194235A New Hampshire. Council on Problems of the Aging.        
/authors/OL4896994A New Hampshire. Council      
/authors/OL4896982A New Hampshire. Council      
/authors/OL4896998A New Hampshire. Council      
/authors/OL4896979A New Hampshire. Council      
/authors/OL4897000A New Hampshire. Council      
/authors/OL4896991A New Hampshire. Council      
/authors/OL4896977A New Hampshire. Council      
/authors/OL4896990A New Hampshire. Council      
/authors/OL2374994A New Hampshire. Council on Postwar Planning and Rehabilitation.      
/authors/OL4896986A New Hampshire. Council  
tfmorris commented 6 years ago

Not sure it's significant, but https://openlibrary.org/authors/OL4620383A.json doesn't have a created key, while https://openlibrary.org/authors/OL4943246A.json, which is searchable, does.

If the update code is depending on that to exist for some reason, it could be unhappy.

tfmorris commented 6 years ago

I think my note above about the created key was a red herring.

I was looking at the most prolific authors and have a few new record setters which don't show up in search. The first column is the number of works they've authored.

14131 /authors/OL2336667A United States. Congress. Senate. Committee on Pensions 7237 /authors/OL4789289A United States. Congress. Senate. Committee on Claims 6047 /authors/OL2375088A United States. Congress. House. Committee on Invalid Pensions. 5498 /authors/OL4766486A United States. Congress. House. Committee on Claims

tfmorris commented 6 years ago

Here's another batch. Except for the New Hampshire. Council author mentioned above, all others appear to be United States. Congress. entries of some flavor or another. The range of IDs indicates that they weren't all created at the same time.

As an aside, the number in parentheses is the number of works listed on the author's page. The entries with asterisks have counts which are off pretty dramatically.

5082 /authors/OL4521280A United States. Congress. Senate. Committee on Commerce (4560) 5066 /authors/OL2323345A United States. Congress. House. Committee on War Claims. 4894 /authors/OL4523254A United States. Congress. House. Committee on the Judiciary (3643) 4839 /authors/OL184870A United States. Congress. House. Committee on Military Affairs. (4813) 3531 /authors/OL4521525A United States. Congress. House. Committee on Interstate and Foreign Commerce (3044) * 3407 /authors/OL4521082A United States. Congress. Senate. Committee on the Judiciary (2906) 3326 /authors/OL4774429A United States. Congress. Senate. Committee on Military Affairs (3050) 3230 /authors/OL47374A United States. Congress. House. Committee on Rules. (1517) * 2941 /authors/OL4648820A United States. Congress. House. Committee on Naval Affairs (2577) 2915 /authors/OL4835960A United States. Congress. House. Committee on Rivers and Harbors (699) 2861 /authors/OL4521469A United States. Congress. Senate. Committee on Foreign Relations (2271) 2761 /authors/OL43204A United States. Congress. Senate. Committee on Energy and Natural Resources. (1889) 2699 /authors/OL4527173A United States. Congress. House. Committee on Ways and Means (1843) 2545 /authors/OL4522330A United States. Congress. Senate. Committee on Finance (2292) 2436 /authors/OL4528217A United States. Congress. House. Committee on Foreign Affairs (1822) 2268 /authors/OL4521848A United States. Congress. Senate. Committee on Appropriations (1746) 2086 /authors/OL4657773A United States. Congress. Senate. Committee on the District of Columbia (1675) 2008 /authors/OL159513A United States. Congress. House. Committee on Public Lands (1836) * 1004 /authors/OL4839666A United States. Congress. Senate. Committee on Public Lands and Surveys (928) 100 /authors/OL868250A United States. Congress. House. Committee on the Judiciary. Subcommittee on Monopolies and Commercial Law. (90) 90 /authors/OL988950A United States. Congress. House. Committee on Science and Technology. Subcommittee on Natural Resources, Agriculture Research, and Environment. (82)

tfmorris commented 6 years ago

A couple more and a new theory:

789 "/authors/OL24127A" Metropolitan Museum of Art (New York, N.Y.) (408) 374 "/authors/OL4480A" India. Parliament. Committee on Public Undertakings. (125)

Perhaps two or more periods in the name is what causes the problem? Or non-terminal periods?

On the other hand, there's a duplicate MOMA entry with the exact same name which did get indexed correctly:

Of course the author which can't be found has 408 works associated with it, while the correctly indexed author has none. :-(

LeadSongDog commented 6 years ago

https://openlibrary.org/search/authors?q=Metropolitan+Museum+of+Art finds the merged author after I made this edit: https://openlibrary.org/authors/OL24127A/Metropolitan_Museum_of_Art_(New_York_N.Y.)?b=4&a=3&_compare=Compare&m=diff One might suspect the two are somehow related. There were earlier issues related to searches when the stopword "New" was part of the query.

mekarpeles commented 6 years ago

I remember seeing something similar and thinking "New" was a problematic keyword. There's an issue about it, I don't recall if it was a related issue or if new was actually the problem. I'll look into it!

mekarpeles commented 6 years ago

@LeadSongDog re: Author search, please see #699. Somewhat embarrassed to say, I'm not sure re-indexing is occurring at all in several such cases. https://github.com/internetarchive/openlibrary/issues/351#issuecomment-356123527

mekarpeles commented 6 years ago

I may need to cc: @gdamdam to make sure I kick off this solr-updater process correctly. I believe he has internal docs on this process which I should try to document more publicly

mekarpeles commented 6 years ago

related: #714

hornc commented 6 years ago

https://openlibrary.org/search/authors?q=New+Hampshire.+Council give 2 results @ 7:40 UTC https://openlibrary.org/authors/OL7359992A and https://openlibrary.org/authors/OL7406663A

As an experiment I am going to add https://openlibrary.org/authors/OL4896977A/New_Hampshire._Council to the manual admin/solr interface @ 7:40 UTC

and check the search results sometime later. EDIT: search results had not changed within 7mins, but OL4896977A was in the search results at 9:15 UTC (the next time I checked, I'm sure it was added a lot sooner than that). This shows that these authors can be added to the index. Normally this will occur on any edit to the record.

Authors are added by the solr updater if they appear in the infogami edit logs, which means when they any of the record's data changes. The admin/solr interface allows admins to add a record into that same update pipeline. I expect OL4896977A will show up in search results within 15mins.

I think we need a way to identify and re-index items that have, for whatever reason, missed indexing in the past. There may be a way to do targeted partial re-indexes if we can identify the targets.

hornc commented 6 years ago

The one thing I notice these authors have in common is that they were all initially imported in 2008, which is the earliest year OL records were added, and before a lot of the processes were finalised.

https://openlibrary.org/authors/OL4528217A/ was created in 2008, but last edited in 2012, which by my theory above, should have been indexed. It's not in search results https://openlibrary.org/search/authors?q=United+States.+Congress.+House.+Committee+on+Foreign+Affairs @ 9:32 UTC (when I made the edit)

I am making an edit to the record now to see if it gets added to the index soon. EDIT OL4528217A showed up in search results at 9:47 UTC

tfmorris commented 6 years ago

Good to know that these records aren't fundamentally broken in some way and can be indexed if we can identify them.

Implicit in the results of this experiment is that the search index probably hasn't been rebuilt since 2008, which is kind of a frightening thought. Who knows how many holes and errors are in it...

tfmorris commented 5 years ago

A spot check shows that these are successfully indexed in my dev Solr instance. For example, "New Hampshire. Council" returns all 21 author records listed above and "United States. Congress. Senate. Committee on Pensions". The issues with the work_count also appear to be fixed in the new index.

xayhewalo commented 4 years ago

Another issue that will be affected/fixed by #2246

mekarpeles commented 4 years ago

I think we could use a top-level issue which more surgically outlines and enumerates things which are not indexed by search (there are plenty of works as well which exist and don't seem to be indexed)

cdrini commented 3 years ago

I think @hornc is correct, solr-updater was likely broken/down/? at the time these authors were created, and they were never re-indexed. All the example here now work, because we've done a few full re-indexes over the last year.