samvera / hyku

Hyku: A multi-tenant Hyrax application built on the latest and greatest Samvera community components. Brought to you by the Hydra-in-a-Box project partners and IMLS; maintained by the Hyku Interest Group.
https://samvera.atlassian.net/wiki/spaces/hyku/overview
Other
94 stars 47 forks source link

Images with metadata conflict fail to load in UV #1394

Open hannahfrost opened 7 years ago

hannahfrost commented 7 years ago

From St. Lawrence pilot: We are experiencing an issue whereby newly uploaded images are listing as “Not Found” via the Universal Viewer. This is for images added from accounts with the admin role and also via a basic user account.

Steps to Reproduce:

Screenshots attached.

Example record image-iiif-unauthorized image-not-found-iiif-viewer

:

Additional Details: Images were uploaded via a MacPro/MacOS Sierra version 10.12.6, Chrome 60.x and Firefox 52.x

atz commented 7 years ago

Image is definitely there and can be viewed and downloaded via the "Actions" dropdown or on the fileset page: https://stlawu.pilot.hykudirect.org/concern/parent/0a13e77d-b826-46b7-b5c1-5a2ee20e6cdf/file_sets/96c4f61a-c4c9-4d98-8f00-34b3974d1bc4

So this is just a problem w/ UV integration.

atz commented 7 years ago

The JS console has two errors:

Uncaught Error: See almond README: incorrect module build, no module name
    at define (application-a558d44….js:24)
    at application-a558d44….js:29
    at application-a558d44….js:29

AND

GET https://stlawu.pilot.hykudirect.org/images/96c4f61a-c4c9-4d98-8f00-34b3974d1bc4%2Ffiles%2F9c2ed1ba-3237-4213-873c-ea9727aedaf7/info.json  404 Not Found

That %2Ffiles%2F looks like a possible URI-encoding failure (a %2Fs would otherwise be a /). The response for the request as sent is:

{
error: "no info"
}

But changing those to slashes only gets a slightly more detailed failure:

{
status: 404,
error: "Not Found"
}

So it isn't that basic. Unsure whether the almond error is relevant.

atz commented 7 years ago

Our images route is mapped to Riiif::Engine. From rails routes:

riiif          /images                                                         Riiif::Engine
...
 image GET  /:id/:region/:size/:rotation/:quality.:format riiif/images#show {:format=>"jpg", :rotation=>/[\w.]+/, :region=>"full", :quality=>"default", :model=>"riiif/image", :size=>/(!|pct:)?[\w.,]+/}
  info GET  /:id/info.json(.:format)                      riiif/images#info {:format=>"json", :model=>"riiif/image"}
  base GET  /:id(.:format)                                riiif/images#redirect

As sent, this is an info request with an id of:

96c4f61a-c4c9-4d98-8f00-34b3974d1bc4%2Ffiles%2F9c2ed1ba-3237-4213-873c-ea9727aedaf7

So somebody with UV knowledge needs to determine: does that make sense?

The thumnail image that does display is retrieved via:

/images/96c4f61a-c4c9-4d98-8f00-34b3974d1bc4%2Ffiles%2F9c2ed1ba-3237-4213-873c-ea9727aedaf7/full/!150,300/0/default.jpg

I.E., it appears to use the same ID, also passed to the Riiif::Engine, so my assumption is that the ID does make sense.

atz commented 7 years ago

Riiif controller servicing the request is here: https://github.com/curationexperts/riiif/blob/v1.4.4/app/controllers/riiif/images_controller.rb#L33-L45

The (summarized) logic that is failing is:

model.new(params[:id]).info.valid?

And .valid? is just:

    # Image information is only valid if height and width are present.
    # If an image info service doesn't have the value yet (not characterized perhaps?)
    # then we wouldn't want to cache this value.
    def valid?
      width.present? && height.present?
    end

Obviously, the comment there suggests a problem: characterization. Recommend somebody with access to that system run a characterization job on the file/fileset and see if that fixes it.

bbranan commented 7 years ago

A bit of testing that seems to confirm the conclusion @atz came to: Uploading the same file again, in a new tenant with a simple name and title yielded the same "Not Found" error: https://testtenant.pilot.hykudirect.org/concern/images/20a3bb97-eb0e-449d-9ac8-3a2e28b703f7. Doing the same thing again with a different image file worked fine: https://testtenant.pilot.hykudirect.org/concern/images/454a745c-113d-4db3-8d62-e268f8bf7af9

I'd be happy to run a characterization job on the image file. Pointers for how to do that would be helpful.

mjgiarlo commented 7 years ago

Hey, @bbranan. Here's how to run the characterization job in the Rails console:

file_set = FileSet.find('d2360902-2ad8-4d10-a8a5-e947ec953577')
# Attempted to init base path `66572f04-a2ad-4b64-852b-08df2dfcdb61`, but it already exists
# => #<FileSet id: "d2360902-2ad8-4d10-a8a5-e947ec953577", head: [], tail: [], depositor: "mjgiarlo@stanford.edu", title: ["2012-01-21_17-19-39_815.jpg"], date_uploaded: "2017-08-04 16:07:06", date_modified: "2017-08-04 16:07:06", label: "2012-01-21_17-19-39_815.jpg", relative_path: nil, import_url: nil, resource_type: [], creator: ["mjgiarlo@stanford.edu"], contributor: [], description: [], keyword: [], license: [], rights_statement: [], publisher: [], date_created: [], subject: [], language: [], identifier: [], based_near: [], related_url: [], bibliographic_citation: [], source: [], access_control_id: "de0bf3ad-38ed-4a21-a0b4-4f489ce4ecf7", embargo_id: nil, lease_id: nil>
io = JobIoWrapper.find_by(file_set_id: file_set.id)
# => #<JobIoWrapper id: 10, user_id: 1, uploaded_file_id: 82, file_set_id: "d2360902-2ad8-4d10-a8a5-e947ec953577", mime_type: nil, original_name: nil, path: "uploads/66572f04-a2ad-4b64-852b-08df2dfcdb61hyrax/...", relation: "original_file", created_at: "2017-08-04 16:07:07", updated_at: "2017-08-04 16:07:07">
file_id = file_set.original_file.id
# => "d2360902-2ad8-4d10-a8a5-e947ec953577/files/d77a38ec-2a23-4ed9-b84c-ce3bd868231e"
CharacterizeJob.perform_later(file_set, file_id, io.uploaded_file.uploader.path)
# [ActiveJob] Enqueued CharacterizeJob (Job ID: 22840e35-dc80-46d2-b968-42fffab12505) to BetterActiveElasticJob(default) with arguments: #<GlobalID:0x0055caaaf1cf00 @uri=#<URI::GID gid://hyku/FileSet/d2360902-2ad8-4d10-a8a5-e947ec953577>>, "d2360902-2ad8-4d10-a8a5-e947ec953577/files/d77a38ec-2a23-4ed9-b84c-ce3bd868231e", "uploads/66572f04-a2ad-4b64-852b-08df2dfcdb61hyrax/uploaded_file/file/82/2012-01-21_17-19-39_815.jpg"
# => #<CharacterizeJob:0x0055caaaf28238 @arguments=[#<FileSet id: "d2360902-2ad8-4d10-a8a5-e947ec953577", head: [], tail: [], depositor: "mjgiarlo@stanford.edu", title: ["2012-01-21_17-19-39_815.jpg"], date_uploaded: "2017-08-04 16:07:06", date_modified: "2017-08-04 16:07:06", label: "2012-01-21_17-19-39_815.jpg", relative_path: nil, import_url: nil, resource_type: [], creator: ["mjgiarlo@stanford.edu"], contributor: [], description: [], keyword: [], license: [], rights_statement: [], publisher: [], date_created: [], subject: [], language: [], identifier: [], based_near: [], related_url: [], bibliographic_citation: [], source: [], access_control_id: "de0bf3ad-38ed-4a21-a0b4-4f489ce4ecf7", embargo_id: nil, lease_id: nil>, "d2360902-2ad8-4d10-a8a5-e947ec953577/files/d77a38ec-2a23-4ed9-b84c-ce3bd868231e", "uploads/66572f04-a2ad-4b64-852b-08df2dfcdb61hyrax/uploaded_file/file/82/2012-01-21_17-19-39_815.jpg"], @job_id="22840e35-dc80-46d2-b968-42fffab12505", @queue_name="default", @priority=nil, @executions=0>
mjgiarlo commented 7 years ago

@atz Can you check my math above? Specifically: we should be able to rely on io.uploaded_file.uploader.path working in our AWS deployment, right? FWIW, I tested this on a random FileSet in one of my test tenants and io.uploaded_file.uploader.path and io.path returned the same path ("uploads/66572f04-a2ad-4b64-852b-08df2dfcdb61hyrax/uploaded_file/file/82/2012-01-21_17-19-39_815.jpg"), so maybe in this deployment these can be used interchangeably?

(If you find you need to run this regularly, @bbranan, we may want to create a new ticket to make this into a rake task. File that away for later, I guess?)

bbranan commented 7 years ago

Given that the same file produced the same resulting error in two different tenants (and I tried more than once in my test tenant), it doesn't appear that re-running characterization will do the trick. @mjgiarlo ran FITS locally on the file, resulting in:

 <identification status="CONFLICT">
    <identity format="JPEG EXIF" mimetype="image/jpeg" toolname="FITS" toolversion="1.1.1">
      <tool toolname="Droid" toolversion="6.1.5" />
      <tool toolname="Exiftool" toolversion="10.00" />
      <tool toolname="NLNZ Metadata Extractor" toolversion="3.6GA" />
      <version toolname="Droid" toolversion="6.1.5">2.2.1</version>
      <externalIdentifier toolname="Droid" toolversion="6.1.5" type="puid">fmt/645</externalIdentifier>
    </identity>
    <identity format="JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=14, height=11200, bps=0, compression=none, PhotometricI
ntepretation=RGB, orientation=upper-left, width=10400], baseline, precision 8, 2000x2154, frames 3" mimetype="image/jpeg" toolname="FITS" toolversion=
"1.1.1">
      <tool toolname="file utility" toolversion="5.25" />
    </identity>
  </identification>

Note that both width and height are included here, but also status="CONFLICT" and two differing identity tags.

jcoyne commented 7 years ago

I reported this in the fits repo and I believe it's fixed in a more recent (possibly master) version

mjgiarlo commented 7 years ago

I used FITS 1.1.1 (released on May 30th of this year) to generate the output above, FWIW.

atz commented 7 years ago

@mjgiarlo: you can only rely on uploader.path there when the file happens to have been on the same system in carrierwave cache (i.e. not on a worker system). Otherwise you get nil for an unfetched remote object, as we experienced. If you want to guarantee access to uncached remote content, you need to ask uploader.sanitized_file which pulls the content into a StringIO.

> Hyrax::UploadedFile.first.uploader.sanitized_file
 => #<CarrierWave::SanitizedFile:0x007f8669a83238 @file=#<StringIO:0x007f8669a8aa38>, @original_filename="devo_freedom_of_choice_album_p.jpg", @content_type="image/jpeg"> 

You can see where the uploader's implementation is super-halfass about path, not even confident enough to fully delegate it to the file object.

TL;DR: uploader.path is allowed to fail, mainly because the (configurable) file object class might not implement it. Good news: the 3rd argument to CharacterizeJob.perform_later is optional, so if the file isn't local (anymore) you can just leave it off. The job will pull down the file itself.

atz commented 7 years ago

Image has bad metadata.
screen shot 2017-08-09 at 6 01 51 pm screen shot 2017-08-09 at 6 01 59 pm

The part where Photoshop CS5 says the image is 72x72 should be removed and the image re-uploaded. I suspect this piece of metadata comes from a PSD file and not the JPG, specifically the PSD's thumbnail.

We can talk about how the system could give better feedback here, but this is basically a WONTFIX. We are not implementing heuristics for effectively resolving conflicting image metadata while preserving it as "everybody else downstream's problem". And we aren't going to use heuristics to subtly alter uploaded files as they are preserved. If there's no further development direction added, I will close this ticket in a couple days.

bbranan commented 7 years ago

I agree that the uploaded image file should not be changed as it is stored. However, the system should be able to recognize a failure in derivative generation and not attempt to display the universal viewer for that image. Of course, also providing feedback to the user to let them know about the failure (and what could be done to resolve it) would be ideal.

There are a variety of reasons why it may not be practical or desirable to edit an original file, even to fix incorrect metadata. It should still be possible to store those files in the repository without them appearing to be broken in the UI. In this case, thumbnail generation seems to have worked fine, which is still valuable.

atz commented 7 years ago

They are broken in the UI because they are broken in fact. UV requires metadata we cannot accurately provide because the user did not provide it. Might as well complain that "my picture of a giraffe looks like fish."

I can imagine being more explicit about why failure happens, but I can't really imagine papering over the brokenness to make it invisible or semi-functional:

bbranan commented 7 years ago

@atz if I'm hearing you correctly, you seem to be suggesting that in this scenario we should either (1) reject the image file outright or (2) store the file and allow the UV viewer to display an error.

I'm suggesting that neither of these approaches are preferable. While the image in question may have internal metadata which is not entirely consistent, the primary purpose of the file is to display an image, not to be a metadata structure. Every image viewing application I have tried is able to display the image. The user noted that he has stored and used this file successfully on ContentDM, DSpace, Shared Shelf, and Drupal. Does this mean the metadata shouldn't be fixed? No. But from the perspective of most users (right or wrong) if Hyku can't handle this file, then Hyku is broken, not the file.

I would suggest that failing gracefully (by not displaying the UV viewer if we know it will fail to display the file) is a far better approach than defaulting to a "fix your data" response. The person depositing the data is often not in a position to do anything with the data other than deposit it. Telling them there is a problem with a file is great, but doing everything we can to handle it as-is is better, even if that means turning off a feature or two.

mjgiarlo commented 7 years ago

While the embedded metadata on these files is odd, I still do not understand why the FITS output looks OK and yet we're not seeing the characterization terms show up on these files as expected. I wonder if we can find someone in the community who's dealt with files like these before? That @jpstroop fellow knows a lot about image file formats... :)

atz commented 7 years ago

@bbranan My bullet points were problematizing statements, not recommendations, so please don't read them that way. UV requires the metadata be correct. It is incorrect. [UPDATE: the rest of this comment is basically mistaken.]

I agree that some type of detection or fallback might be an improvement, but this isn't a bug in UV and is not a bug in Hyku.

That would be a novel feature (not a bugfix) requiring a different flow. Right now the modeled object (that detects invalidity) is entirely in UV client-side JS. We would need to build a parallel capability on the server-side. For a single image page, maybe that wouldn't seem like a big deal, but for an entire IIIF manifest worth of (potentially thousands of) images, it very much would be a big deal and would undercut many of the advantages of UV.

@mjgiarlo See the screenshots and my commentary above. The metadata does NOT look OK. It is internally in conflict: 2000x2154 vs 72x72. Since we are primarily concerned with archival preservation of valid data, accepting or papering over points of invalidity is hardly a feature. Making it easier to get malformed data in the archive is counter-productive.

mjgiarlo commented 7 years ago

@atz Sorry, what I meant was that the <metadata> section of the FITS (1.1.1) output looks fine to me:

<metadata>
    <image>
      <compressionScheme toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">Uncompressed</compressionScheme>
      <imageWidth toolname="Exiftool" toolversion="10.00">2000</imageWidth>
      <imageHeight toolname="Exiftool" toolversion="10.00">2154</imageHeight>
      <colorSpace toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">RGB</colorSpace>
      <iccProfileName toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">Adobe RGB (1998)</iccProfileName>
      <iccProfileVersion toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">2.1.0</iccProfileVersion>
      <YCbCrSubSampling toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">1 1</YCbCrSubSampling>
      <orientation toolname="Exiftool" toolversion="10.00">normal*</orientation>
      <samplingFrequencyUnit toolname="NLNZ Metadata Extractor" toolversion="3.6GA" status="SINGLE_RESULT">in.</samplingFrequencyUnit>
      <xSamplingFrequency toolname="Exiftool" toolversion="10.00">72</xSamplingFrequency>
      <ySamplingFrequency toolname="Exiftool" toolversion="10.00">72</ySamplingFrequency>
      <bitsPerSample toolname="Exiftool" toolversion="10.00">8 8 8</bitsPerSample>
      <samplesPerPixel toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">3</samplesPerPixel>
      <scanningSoftwareName toolname="Exiftool" toolversion="10.00" status="SINGLE_RESULT">Adobe Photoshop CS5 Windows</scanningSoftwareName>
      <exifVersion toolname="Exiftool" toolversion="10.00">0221</exifVersion>
      <lightSource toolname="NLNZ Metadata Extractor" toolversion="3.6GA" status="SINGLE_RESULT">unknown</lightSource>
    </image>
  </metadata>
atz commented 7 years ago

Ah, I see. Yeah, you are right. xSamplingFrequency (DPI) is not the same as X Pixel Dimension. The dimensions are not in conflict, so my premise was incorrect.

The XML components of the <identification status="CONFLICT"> are:

   <identity format="JPEG EXIF" mimetype="image/jpeg" toolname="FITS" toolversion="1.1.1">
     <tool toolname="Droid" toolversion="6.1.5" />
     <tool toolname="Exiftool" toolversion="10.00" />
     <tool toolname="NLNZ Metadata Extractor" toolversion="3.6GA" />
     <version toolname="Droid" toolversion="6.1.5">2.2.1</version>
     <externalIdentifier toolname="Droid" toolversion="6.1.5" type="puid">fmt/645</externalIdentifier>
   </identity>

vs:

   <identity format="JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=14, height=11200, bps=0, compression=none, PhotometricI
ntepretation=RGB, orientation=upper-left, width=10400], baseline, precision 8, 2000x2154, frames 3" mimetype="image/jpeg" toolname="FITS" toolversion=
"1.1.1">
     <tool toolname="file utility" toolversion="5.25" />
   </identity>

I don't know why the <tool toolname="file utility" toolversion="5.25" /> element couldn't be included in the larger block, but the format is the point of divergence. I don't know FITS well enough to know the significance.

mjgiarlo commented 7 years ago

FWIW, even some of the fixture objects in hydra-works, where the characterization service lives, have this same (or similar) conflict in them. See a sampling of the fits_*.xml docs here: https://github.com/samvera/hydra-works/tree/master/spec/fixtures

atz commented 7 years ago

So the file utility output is entirely based on the local (unix) file executable and the "magic files" defined at the system level.

Hyrax has really limited coverage for how broad the matrix of interactions underneath a FITS dependency might be:

Locally, I have:

On the same file, I get:

  <identification>
    <identity format="Exchangeable Image File Format" mimetype="image/jpeg" toolname="FITS" toolversion="0.8">
      <tool toolname="file utility" toolversion="5.04" />
      <tool toolname="Exiftool" toolversion="9.13" />
      <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
    </identity>
  </identification>

No conflicts in that part. I do get a conflict in in <fileinfo> though, presumably unrelated:

    <created toolname="Exiftool" toolversion="9.13" status="CONFLICT">2011:12:13 12:46:36-05:00</created>
    <created toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="CONFLICT">2012:04:12 11:27:33</created>

@mjgiarlo Note though, that my Exchangeable Image File Format is another different format that doesn't match either of the two already in conflict.

On my local Hyrax, the same image appears to be valid:

> fs = RareBooks::Atlas.find('0k225b04p').file_sets.first
 => #<FileSet id: "q237hr920", head: [], tail: [], depositor: "archivist1@example.com", title: ["xlm_microscopy_schreiber_02-summer-flounder_sc.jpg"], date_uploaded: "2017-08-15 22:51:16", date_modified: "2017-08-15 22:51:16", label: "xlm_microscopy_schreiber_02-summer-flounder_sc.jpg", relative_path: nil, import_url: nil, resource_type: [], creator: ["archivist1@example.com"], contributor: [], description: [], keyword: [], license: [], rights_statement: [], publisher: [], date_created: [], subject: [], language: [], identifier: [], based_near: [], related_url: [], bibliographic_citation: [], source: [], access_control_id: "39835492-ec7f-4269-a1b8-e329ae6a8d6e", embargo_id: nil, lease_id: nil> 
> fs.height
 => ["2154"] 
> fs.width
 => ["2000"] 
> fs.valid?
 => true 
atz commented 7 years ago

In my local fedora itself: screen shot 2017-08-15 at 6 37 23 pm