internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.2k stars 1.36k forks source link

Newly imported authors don't get indexed properly for search #2275

Open tfmorris opened 5 years ago

tfmorris commented 5 years ago

Description

It's a relatively common occurrence to come across OpenLibrary search listings which contain authors which have 0 works listed even though they have at least one work listed when you click through to their author record.

It occurred to me today that this is likely because the author entries are created by searching the Solr index, which likely hasn't yet been populated if both the work and author are getting added at the same time (as is likely with a new author's first book).

Relevant url?

https://openlibrary.org/search/authors.json?q=Brandon+Hobson&mode=everything http://openlibrary.org/authors/OL7444218A.json

Expectation

All authors, including newly added authors, are indexed correctly.

Proposal

We can do one of two things:

  1. Create the new author doc for Solr directly using a placeholder template of:
    {"work_count": 1,
    "top_work": "<new work title>",
    "top_subjects": "<subjects from new work>",
    }
  2. Split the update into two and make sure that the update for the work is committed first.

I think I prefer the first approach.

LeadSongDog commented 5 years ago

Yes, timing is a definite factor. After creating the new work and author records, further works cannot be linked to that author record until it has been indexed, which may take many minutes. A new contributor unfamiliar with the behavior will think that the author record was never created, often resulting in duplication.

LeadSongDog commented 5 years ago

Here's an older example. Two records that ImportBot created within 7 seconds back in 2008 for an author who already had a record: {"name": "Grieg Marshall Spankie", "last_modified": {"type": "/type/datetime", "value": "2008-04-30 09:38:13.731961"}, "key": "/authors/OL3374795A", "type": {"key": "/type/author"}, "id": 13517871, "revision": 1} {"name": "Grieg M. Spankie", "personal_name": "Grieg M. Spankie", "last_modified": {"type": "/type/datetime", "value": "2008-09-16 18:45:08.660598"}, "key": "/authors/OL4754134A", "type": {"key": "/type/author"}, "id": 20497658, "revision": 1} {"name": "Grieg M. Spankie", "personal_name": "Grieg M. Spankie", "last_modified": {"type": "/type/datetime", "value": "2008-09-16 18:45:15.392316"}, "key": "/authors/OL4754143A", "type": {"key": "/type/author"}, "id": 20497692, "revision": 1}

xayhewalo commented 4 years ago

Making this a sub-task of #789. Assigning @cdrini per slack discussions.