projectblacklight / spotlight

Spotlight enables librarians, curators, and others who are responsible for digital collections to create attractive, feature-rich websites that highlight these collections.
Other
157 stars 64 forks source link

Fault in indexing of Spotlight::Resources::Upload objects #2989

Open mephillips-durham opened 7 months ago

mephillips-durham commented 7 months ago

Summary

During the indexing of objects of the type Spotlight::Resources::Upload (which can be created either via Add Items -> Upload item or Add Items -> Upload multiple items) the background job attempts to look up a record in spotlight_featured_images with id equal to the spotlight_resources.id of the object being indexed. I think it should be using spotlight_resources.upload_id as the value to look for in spotlight_featured_images.id

Impact

If there are fewer rows in spotlight_featured_images than there are in spotlight_resources then the lookup will fail, and the object is not indexed in Solr.

Steps to reproduce

In an empty Spotlight, with minimal configuration changes from the current latest engine, I created an exhibit and published it. I did not set a thumbnail or masthead image for the exhibit. I then use the Upload item form to upload a single JPEG and set just the title field in the form. The record was successfully indexed and appears in the exhibit user interface and in Solr.

At this point, the following queries report 1 for each value:

select count(*), max(id) from spotlight_featured_images;
select count(*), max(id) from spotlight_resources;

The spotlight_resources and spotlight_solr_document_sidecars tables are as follows:

sqlite> select * from spotlight_resources ;
          id = 1
  exhibit_id = 1
        type = Spotlight::Resources::Upload
         url =
        data = --- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
full_title_tesim: Castle tour 8
spotlight_upload_description_tesim: ''
spotlight_upload_attribution_tesim: ''
spotlight_upload_date_tesim: ''

  indexed_at =
  created_at = 2023-11-20 11:50:10.316739
  updated_at = 2023-11-20 11:50:10.316739
    metadata =
index_status =
   upload_id = 1

sqlite> select * from spotlight_solr_document_sidecars;
           id = 1
   exhibit_id = 1
       public = 1
         data = ---
configured_fields: !ruby/hash:ActiveSupport::HashWithIndifferentAccess
  full_title_tesim: Castle tour 8

   created_at = 2023-11-20 11:50:10.416183
   updated_at = 2023-11-20 11:50:19.846637
  document_id = 1-1
document_type = SolrDocument
  resource_id = 1
resource_type =
 index_status =

Next, using the IIIF URL facility in Add Items, I imported the following IIIF URL for a single manifest: https://iiif.durham.ac.uk/manifests/trifle/32150/t1/m2/z1/t1m2z10wq47s/manifest

This imported the record and created a record in Solr. No row was created in spotlight_featured_images, so the COUNT(*) and MAX(id) were both still 1 for that table, and 2 for spotlight_resources. The spotlight_resources and spotlight_solr_document_sidecars records were as follows:

sqlite> select * from spotlight_resources where id=2;
          id = 2
  exhibit_id = 1
        type = Spotlight::Resources::IiifHarvester
         url = https://iiif.durham.ac.uk/manifests/trifle/32150/t1/m2/z1/t1m2z10wq47s/manifest
        data =
  indexed_at =
  created_at = 2023-11-20 12:05:01.184968
  updated_at = 2023-11-20 12:05:01.184968
    metadata =
index_status =
   upload_id =
sqlite> select * from spotlight_solr_document_sidecars where resource_id=2;
           id = 2
   exhibit_id = 1
       public = 1
         data = ---
readonly_published_tesim:
- '1901'
readonly_attribution_tesim:
- Durham University Library
readonly_description_tesim:
- 'Durham University Rugby Football XV 1900-1, in a photographer''s studio, in light
  coloured broad-hooped shirts, dark shorts, various socks, and boots, with some blazers
  (?full and half palatinates), caps, and scarves, by Grand Studio of Pilgrim St,
  Newcastle, identified: W. Saunderson (Science), L.A.H. Bulkeley (Medicine), W. Fleming
  (Medicine), H.W. Cousins (Science), W. Seymour (Medicine), H.F.D. Turner (University),
  W.R. Heath (Hatfield), J.C. Hill (Hatfield), S. Raw (Medicine), B.S. Robson (Capt)
  (Medicine), F.J. Gowans (Medicine), F.W. Kemp (Medicine), L. Smith (Hatfield), C.G.
  King (Hatfield), S. Cochrane (Medicine).'
readonly_license_tesim:
- http://creativecommons.org/licenses/by-nc-nd/4.0/legalcode

   created_at = 2023-11-20 12:05:03.491782
   updated_at = 2023-11-20 12:05:03.629979
  document_id = 27c59d671bf8cd8d07894cffe38d7712
document_type = SolrDocument
  resource_id = 2
resource_type =
 index_status =

Next I imported another JPEG image via the "Upload item" form. This time the background job indexing the record reported the following:

2023-11-20T12:08:49.824Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c INFO: start
2023-11-20T12:08:50.561Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c INFO: Performing Spotlight::ReindexJob (Job ID: 4f391c36-7276-42f5-affc-fba0d522205c) from Sidekiq(default) enqueued at 2023-11-20T12:08:49Z with arguments: #<GlobalID:0x00007fd74a4bbb48 @uri=#<URI::GID gid://dur-spotlight/Spotlight::Resources::Upload/3>>, {"validity_token"=>nil}
2023-11-20T12:08:50.843Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c ERROR: Caught exception Couldn't find Spotlight::FeaturedImage with 'id'=3
2023-11-20T12:08:50.922Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c INFO: Indexing item #<Spotlight::Resources::Upload id: 3, exhibit_i... in resource 3 (0 / 1) (207.0ms)
2023-11-20T12:08:51.159Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c INFO: Performed Spotlight::ReindexJob (Job ID: 4f391c36-7276-42f5-affc-fba0d522205c) from Sidekiq(default) in 611.7ms
2023-11-20T12:08:51.160Z pid=77819 tid=1wdn class=Spotlight::ReindexJob jid=37231af6eafc06e80b184d4c elapsed=1.336 INFO: done

The signficant line is this:

Couldn't find Spotlight::FeaturedImage with 'id'=3

The record in spotlight_resources is as follows. There is no corresponding record in spotlight_solr_document_sidecars and no record in Solr.

sqlite> select * from spotlight_resources where id=3;
          id = 3
  exhibit_id = 1
        type = Spotlight::Resources::Upload
         url =
        data = --- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
full_title_tesim: Castle tour 19
spotlight_upload_description_tesim: ''
spotlight_upload_attribution_tesim: ''
spotlight_upload_date_tesim: ''

  indexed_at =
  created_at = 2023-11-20 12:08:49.657818
  updated_at = 2023-11-20 12:08:49.657818
    metadata =
index_status =
   upload_id = 2

If I then create a blank row in spotlight_featured_images using insert into spotlight_featured_images (id) values (3); then I can reindex the item, and uploading further items also succeeds. If I do not apply this fix, then using "Upload Item" to upload static images (or uploading a CSV file) continues to fail, but importing more IIIF items works fine.

Note that if there are more spotlight_featured_images rows than spotlight_resources rows at the time of uploading anotehr JPEG, the problem does not arise, and in the results the correct images are shown against each item. This suggests that the code for searching and displaying results is correctly accessing the corresponding featured image, and only the indexing code is at fault.

mephillips-durham commented 7 months ago

Looks like this may be the same problem reportedly fixed in #2863