Closed jcoyne closed 1 year ago
@jacobthill are all of these assets required? If not, can you tell us which we need?
I'm not sure what all is there. I assume we only need the latest version of each collection. @aaron-collier you can probably answer this question better than I can.
@jacobthill what do you mean by "the latest version of each collection"? These are jpeg image files.
I assume these s3 assets are the transformed harvested csv files and transformed json files (the intermediate representation). I don't know if we save old versions. If so, we don't need to migrate them, we would only need the latest one. I don't work with these files directly so I'm not sure what's there. I think @aaron-collier will be able to clarify.
blocked by #1645
To understand which images are in the S3 uploads directory and where these may be used by the application, I've been reviewing the database tables, the export JSON from the site itself (i.e.what you get from selecting export data through the admin interface), and the S3 folders. Here's a summary of what I've found: a) The database table spotlight_featured_images stores the id for uploaded images for pages, exhibits, and browse categories or searches. These do not appear to be the same as images for specific items. (I'm not sure where the image files for individual items go.) The id is used as a reference in spotlight_pages (as thumbnail_id), spotlight_searches (as masthead_id or thumbnail_id), and spotlight_exhibits (as masthead_id or thumbnail_id). (Reference: https://github.com/projectblacklight/spotlight/blob/main/db/migrate/20150304111111_add_featured_image_to_spotlight_classes.rb) b) The ids from the database table spotlight_featured_images are the set used by the S3 upload folder numbers. (The pattern is defined here https://github.com/projectblacklight/spotlight/blob/main/app/uploaders/spotlight/featured_image_uploader.rb#L14 ). Not every id in the featured image table has an S3 folder. (For example, the table has ids 1, 3, 4, and 5 but there are no upload folders with these numbers. The ids all correspond to the same image name "daniel-h-tong-202079.jpg". ) c) There are 79 items in the spotlight_searches database table. For comparison, the site export JSON has 78 searches listed. d) The spotlight_exhibits has only one row, Spotlight_pages has no values in the thumbnail_id column. b) There are 1052 rows (i.e. featured images) in the spotlight_featured_images table. Not every featured image id has an image file name specified in the database (57 do not). When looking at the specific images and trying to compare with the uploads folder, I saw some duplicate image names. Analyzing for unique image names (and not counting the ids without any image names associated) leaves us with 93 image names. Many of these images have multiple ids associated. For example, "yale_babylon.jpeg" has 5 ids: 691, 764, 837, 912, and 991. Looking at S3, all five appear to be the same image (based on my own assessment). These uploads correspond to last modified dates of 11/29/21, 1/3/22, 2/15/22, 3/3/22, and 1/12/23. Only one of these ids, 991 which is the last modified one in this set, is present in the searches table and corresponds to the thumbnail_id for searches (i.e. browse category) for "Yale Peabody Museum: Babylon Collection".
I could copy/paste the full list of files with matching ids here but, based on the above, it appears we do have duplicate images within the S3 file system. I have not downloaded the full list of S3 bucket names but that may be something to do to confirm these findings.
Regarding files that may need to change with respect to moving away from S3 buckets, in addition to the carrierwave.rb file, DLME has also overriden the riiif.rb file with conditionals checking for Settings.s3.upload_bucket (original Spotlight file: https://github.com/projectblacklight/spotlight/blob/main/lib/generators/spotlight/templates/config/initializers/riiif.rb). Commenting out the s3.upload_bucket line in settings.yml is throwing an error. I'm not sure if that's what we want to do or not (and/or whether we want to remove any S3 handling altogether).
@hudajkhan yes, the intention is to remove any S3 handling.
we can close this ticket once the actual (unique) image files have been moved to disks on the on-premise VMs. we might need a small script to assign them the right database IDs so that the app picks them up and associates them with pages/categories.
Reviewing the export and import functionality, it appears that when you export the exhibit, it generates a serialization of the image itself. When you import, it deserializes (See https://github.com/projectblacklight/spotlight/blob/main/app/services/spotlight/exhibit_import_export_service.rb#L79 and https://github.com/projectblacklight/spotlight/blob/main/app/services/spotlight/exhibit_import_export_service.rb#L210) and generates the image files and places them in the uploads directory. If the exhibit is being repopulated using the import function, we won't have to move the assets at all since they will be regenerated for us (and if we do move them, we'll have multiple copies of the same image in different folders). I have tried this out on my local machine by first deleting all the image files in the public/uploads directory and then importing the exhibit data. Once I do that, the image files are generated in the public/uploads directory once more.
wow!! I didn't expect this functionality. I guess we may not need this ticket at all after we import the exhibit data. @hudajkhan do you want to try exporting the production exhibit and importing it in stage?
Yes I shall try that.
I did try that and see the image files recreated under the appropriate public/uploads/spotlight sub-folders but the exhibit thumbnails are not being displayed. The log shows "Riiif::ConversionError (Unable to execute command "identify -format '%h %w %m %[channels]' /opt/app/dlme/dlme/releases/20230221232246/public/uploads/spotlight/featured_image/image/1156/photo-emirgan.jpg[0]". I had seen a similar error happen on my local machine and @jcoyne had suggested I install imagemagick locally which had solved the problem.
Next step: Based on discussion with @thatbudakguy and @corylown, I am setting up a PR that adds imagemagick to the puppet config for staging. Link to be included below.
Here is the puppet pull request: https://github.com/sul-dlss/puppet/pull/9186 .
With the changes merged, the thumbnails are appearing on staging.
closing this, since we successfully added images and assume it'll work the same in prod.
Reopening until this is done on prod.
Puppet PR for adding imagemagick to dlme production: https://github.com/sul-dlss/puppet/pull/9194
PR merged and I imported the same JSON I had exported from production earlier to import into stage. First, I created the exhibit with the url "library", then imported the JSON. The exhibit thumbnails and masthead are now visible.
excellent. gonna close this now
Change the spotlight storage engine to something else: https://github.com/sul-dlss/dlme/blob/78bb403bc727ec0cff89ac95cc8b6bdd0bdf24cf/config/initializers/carrierwave.rb#L8