jhu-idc / iDC-general

Contains non-code-base specific tickets relating to the Islandora8 for Digital Collection project
0 stars 0 forks source link

Re-ingesting files seems to cause access errors #475

Open bseeger opened 2 years ago

bseeger commented 2 years ago

This started by seeing errors in the Drupal log for drupal generating it's thumbnails - but this seems to happen in the cloud as well when the external services are running derivatives.

A big distinction to catch here is that drupal makes thumbnails for it's admin facing pages (specifically the media page). Those thumbnails are different from the thumbnails we make in the houdini container - those are user facing thumbnails.

If the drupal admin facing ones fail, it's not a big deal, but it looks like houdini is affected by this issue as well. Just something to keep in mind as you read through below.


Note: I'm seeing this message on the cloud server, but not in my dev environment, so I wonder if it's an AWS permission error.

On the test cloud server: upon going to the Media page, I started seeing these errors in the log:

Unable to generate the derived image located at private://styles/thumbnail/private/2022-01/3061-Service File.jpg.

Screen Shot 2022-01-13 at 3 51 22 PM Screen Shot 2022-01-13 at 3 51 07 PM Screen Shot 2022-01-13 at 3 51 34 PM

These are Drupal thumbnails, which are distinct from our derivative thumbnails. Drupal wants to create a thumbnail simply to display the image on the Media List page (to the logged in with rights to use the admin interface). An example:

Screen Shot 2022-01-13 at 4 00 31 PM

These are created in the Drupal container by imagemagick (it does not use the deriv containers). I'm wonder if there's an access error here where Drupal can't get or retrieve the file from AWS? Or maybe the file can't be saved in AWS once created? Not exactly sure what's going on here.

In my local setup, Drupal creates a styles folder in minio for Drupal's thumbnails, like so: Screen Shot 2022-01-13 at 4 02 39 PM

Perhaps that's not successfully happening on AWS?

jhujasonw commented 2 years ago

Is there a way to reliably re-create this issue?

bseeger commented 2 years ago

I'm not sure. But I do see the error after I visit the https://test.digital.library.jhu.edu/admin/content/media page - so maybe just visiting it causes the error to be thrown.

bseeger commented 2 years ago

After looking at this a little more, I think what is happening is related to allowing multiple ingests of the same file items, and may be related to allowing one to rename files during ingest.

We allow admins to rename files in the ingest to fix filename structures that might not work for the system. This may or may not be the issue - the issue could simply be allowing re-ingest of files with the old files sticking around.... that might be more likely the issue.

So if we have the following setup in the ingest:

name upload name new filename
The Name file_name[0].jpg filename.jpg

If an admin runs that type of ingest twice, the legitimate file will have a _X tacked onto to the name before the extension is added (where X is a number). The ingest algorithm tacks on the _X to disambiguate the files - so it keeps the old one around and creates a new one. So, once this is uploaded twice, the 2nd one (and used one) will be named filename_0.jpg. The File entity will be named correctly with no change. There will now be two File objects named the same but pointing to two different files.

File url media_of
The Name filename_0.jpg Media One
The Name filename.jpg

(the first file ingested will be disconnected from any media - so it's essentially unused.)

The one with Media One set in the media_of field is the real one and is considered referenced by an object. But the second one still exists and if grabbed will result in the 403 error we are seeing. Which is what Drupal seems to do when creating its own thumbnail here.

Perhaps this error is innocuous, as the file is really unused and doesn't need a thumbnail. However, what should be checked is that the proper File (filename_0.jpg) does get a drupal thumbnail.

bseeger commented 2 years ago

Providing the Islandora file derivatives function correctly, grabbing the correct file (which they seem to), this issue is probably minor in the grand scheme of things and these errors are just noise in the logs (yes, drupal fails to make its thumbnails, but that's admin facing). Providing that's true, then the only affect is that logged in admins will not see a thumbnail on the file list page (/admin/content/media page). which is no biggie in terms of the system functioning.

bseeger commented 2 years ago

Actually, I think I am wrong about the scope here after watching the cloud for a while. It appears that the wrong URL is handed to the derivatives as well (or they are somehow fetching the wrong ones). This will be an issue for re-ingest in the cloud services. :(

jhu-alistair commented 2 years ago

High priority because it blocks our ability to re-ingest when there are errors in an ingest job.

jhu-alistair commented 2 years ago

Possibly, Bethany thought it was a problem in S3.

DonRichards commented 2 years ago

I agree this would be classified as a high priority and is likely either an S3 or a production-specific environment config setting.

jhujasonw commented 2 years ago

Please re-read her notes on this, she later indicates that this is a file naming issue and NOT an S3 or production specific thing. This appears to be something happening inside of drupal

DonRichards commented 2 years ago

@jhujasonw I think her initial comments were on the right page. I see where she changed her thoughts on it but it appears the URL it generates for an ingest works correctly when this is the first ingest but not for the 2nd. A situation that "could" be the issue is an S3 permission configuration set to write-once (S3:PutObject events) to a bucket. I'm speculating, I have no knowledge of the bucket configurations (and I'm not an expert with S3 ACLs). This just seems like a logical possibility to replicate the odd behavior of "works the first time but not the second". A simple way of checking this would be to either run the exact same migration locally or to trigger a regenerate derivative event in production and see if it fails. If the migration fails locally in the same manner then I'd say it's safe to say the S3 permissions are not the issue. But if it doesn't and triggers a regenerate derivative event ends in a failure in production it's likely to be worth investigating. This is what I thought Bethany had alluded to in her last comment. There are also other situations that could cause odd behavior in migrations (as she indicated above). If the migration isn't triggering an "update" instead of ingesting new media files, this could cause an issue and see all of the items as new. The migration could also address the filename collision as part of it. On the other hand, a production-specific solution could be to disable derivative generation while migrating and create a trigger to create/recreate all thumbnails for a given list of media files if the migration isn't causing other issues.

mjanowiecki commented 2 years ago

Some random info that may or may not be relevant (librarian, not tech person here so please ignore me if this is all nonsense).

DonRichards commented 2 years ago

This may seem off-topic, but we could avoid the naming collision issue by using unique values as the media's filename. In theory, the original file's hash should not be affected by renaming it. Running a script locally like this could copy the files to a new directory, name them to their hash value, log the original names and the new ones, and output when there's an error.

destination='/processed_images'
echo "" > $destination/log
for file in *.{jpg,jpeg,png,tif,tiff,jp2}
do
    sum=`sha256sum "$file"`
    sum="${sum% $file}"
    cp "$file" "$destination/$sum"
    echo "$file $destination/$sum" >> $destination/log
    [ "$(<$file sha256sum)" = "$(<sha256sum $destination/$sum sha256sum)" ] || echo "Problem with $destination/$sum"
done

This should safeguard the filename collision issue and make identifying duplicates simple. This could always be offloaded to a module instead, something like filehash.

mjanowiecki commented 2 years ago

Unfortunately, the filenames are important for librarians to manage files and keep them associated with the right items, so we can't really change them without stakeholder approval.

DonRichards commented 2 years ago

@mjanowiecki This is the case once in Islandora? Or are we talking about offline (preprocessing/reprocessing)?

mjanowiecki commented 2 years ago

@DonRichards I think so. It does help track/verify that the right files are with the right item in an easy way, and it's also an access/usability consideration for the end-users. As an end-user/researcher who might be downloading many different files, it's difficult to organize them when the filename has no recognizable association with the metadata (if that makes sense?).

jhu-alistair commented 2 years ago

@DonRichards and @mjanowiecki - please move work and discussion over to Jira. This issue is now at https://jhulibraries.atlassian.net/browse/LAGS-172