Catch corrupt original files (graphics and sound) earlier and more cleanly

jrochkind commented 2 years ago

Sometimes our "Capture One" digitization pipeline produces TIFFs that are corrupt in some way.

They will make our ingest pipeline fail with an error such as:

on vips dzsave, magick2vips: libMagick error: Sanity check on directory count failed, zero tag directories not supported. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/604. (https://app.honeybadger.io/projects/58989/faults/84983034)
On vipsthumbnail:
- TIFFFetchDirectory: Sanity check on directory count failed, zero tag directories not supported TIFFReadDirectory: Failed to read directory at offset 41751608 magick2vips: libMagick error: magick2vips: unable to read file "/tmp/shrine20220414-5055-9zmz7n.tif"
- https://app.honeybadger.io/projects/58989/faults/84983035

We're not exactly sure what's up with these TIFFs. file and mediainfo both report a sample as image/tiff, without complaint.

However, imagemagick identify will say:

identify  ~/Desktop/soda_fountain_beverages_ncta7o9_156_4oaxl62.tiff
identify: Sanity check on directory count failed, zero tag directories not supported. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Failed to read directory at offset 41751608. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/596.`

And ffprobe (which it turns out, weirdly, you can use on a TIFF), will output to stderr:

[tiff @ 0x122f043d0] IFD offset is greater than image size
[tiff_pipe @ 0x122e05a00] Could not find codec parameters for stream 0 (Video: tiff, none): unspecified size
Consider increasing the value for the 'analyzeduration' (0) and 'probesize' (5000000) options

Every time this happens, we get an error on failed ingest/derivative creation, and need to debug it a bit to figure out why it failed -- oh, corrupt TIFF.

Should our ingest process actually catch this more explicitly and flag it as a corrupt TIFF?
- catch the specific error message from the failure for this specific case, and consider it a corrupt tiff?
- run identify or ffprobe on it in the early ingest pipeline, and abort ingest intentionally if that reveals it as this weirdness?
- In either case, we need a way to make it clear in the Admin UI what happened, and to alert/report so it can be fixed
Once it happens, our staff needs to go delete/re-ingest to correct it. Is this not too hard in our current staff UI, or do we need to provide UI to make it easier?
Should we report to the vendor, perhaps they can fix it so the corrupt TIFF don't happen, or have other suggestions?

eddierubeiz commented 2 years ago

Example of such an apparently corrupt tiff ( I compressed it before attaching it to this ticket because Github wouldn't allow me to attach a raw tiff file. Please decompress tiff the file before examining it.)

soda_fountain_beverages_ncta7o9_156_4oaxl62.tiff.gz

jrochkind commented 2 years ago

i'm gonna move this one back to "backlog", we have other stuff prioritized and don't really plan to do this soon

jrochkind commented 2 years ago

@eddierubeiz When you attach those two sample files (one from a couple months ago, one from last week), they show up as .tar.gz -- am I right that you are .tar.gz'ing them just to attach them, and the original file that was uploaded to try to ingest in our system is file I get if I un-tar-gz it, presumably a file whose suffix is .tiff?

eddierubeiz commented 2 years ago

Yes to all the above questions. I'm going to reattach the one from last week.

jrochkind commented 2 years ago

@eddierubeiz Unfortunately, when I try to open the file I downloaded above, from blasting_accessories_blas_ph2s05f_48_m1zy3an.tiff.gz, I get "The archive is empty". And indeed it's only 99 bytes big, so.

Let's see about the one from two months ago... That one does work! So I have at least one example, okeydoke.

Not sure what's going on with the difficulties attaching this latest one! Are you using an unusual method of making the .gz?

eddierubeiz commented 2 years ago

Let's try this again!

Clarification: the only reason the images above are compressed is because you can't attach a tiff to a github ticket. To view the file as I downloaded it from the digital collections, decompress it first. The file that failed derivative generation has a .tiff extension.

blasting_accessories_blas_ph2s05f_48_m1zy3an.tiff.gz

jrochkind commented 2 years ago

Finding a tool that can cheaply and accurately identify these "bad TIFFs" is harder than I anticipated... maybe we do just need to rescue the error on trying to create derivatives, but that's harder to fit into the architecture.

@apinkney0696 if you have a second, I'm curious what you know about the nature of these "bad" TIFFs created by "Capture One". Do you know anything about what causes them and what they are like?

eddierubeiz commented 2 years ago

We established that it's Capture One problem; Annabel has already contacted Capture One about it. The outcome was that Annabel can avoid the problem by exporting to a local drive rather than to a network drive.

jrochkind commented 2 years ago

Thanks @eddierubeiz . I'm curious if we know anything more than that about the nature of the corruption, what Capture One is doing. Because it might help me figure out how to identify the corrupt files. But I understand we may not.

jrochkind commented 2 years ago

I'm proably going to take a break from this, so documenting where I got:

How to identify invalid tiffs?
- I like using the tiffdump command line, and lean toward this
  - I like that this is a very fast/cheap operation, that doesn't actually do any characterization, it just spits out some TIFF technical details -- or in case of our broken tiffs, to stderr an error message No space for TIFF directory, along with a non-0 exit code.
  - While it doesn't produce machine-readable output (such as JSON), the output is pretty simple. And you can count on a non-0 exit code, then just look at stderr.
  - This is part of libtiff. Not currently installed on our heroku machine, but should be installable via Aptfile libtiff-tools.
  - On my Mac, it's already installed, and by brew -- not sure why, I think as part of brew libtiff package, which was already a dependency of vips etc. (brew has the tiffdump command line in libtiff package itself, while apt segregates it in a separate libtiff-tools)
- I was interested in ffmpeg because it was already installed, but didn't find any way to make it produce machine-readable output that included the relevant error message.
- ImageMagick identify also maybe could do it, but we don't really intentnionally have an imagemagick dependency now, it's sort of an accident transient dependency
- vips was producing error message as consequence of trying to do various things, but also didn't have a great way to get them machine-readably
- I just like how tiffdump doens't really do much, it's not puling out other expensive characterization only to throw it away
Code architecture of where to put the code
- A kithe before_promotion callback does seem the right place, and does work.
  - We DO have access to characterized mime type here, so it is possible only run for things which appear to be TIFF (based on magic byte etc)
  - executing throw :abort succeeds in cancelling shrine "promotion" -- not only will derivatives not be run, but the file wont' even be promoted to "store" storage, it really won't be ingested at all, great!
  - In addition to cancelling promotion, we would want to store the fact of validity error preventing promotion, with some error messages, somewhere.
    - shrine file metadata seemed a good place to me....
    - But I had not succeeded in actually modifying metadata from the before_promotion hook in a way that stuck! It seemed like perhaps cancelling promotion also made it impossible to persist any changes to metadata? But I didn't spend a lot of time with it, needs more investigation. If necessary and helpful, we could create an additional attr_json attribute just for ingest errors... but file metadata still seems best to me, precisely becuase it's associated with the file, and will be automatically cleared out if the file were to change. Needs more experimentation to try to get to work.

eddierubeiz commented 1 year ago

Last night we were hit by two variations on this, on two consecutive pages of Color Standards and Nomenclature:

Page 23

Link to asset for page 23
Download compressed file

Page 23 triggered the following error (Honeybadger link) when ingest ran vipsthumbnail against the file:

TTY::Command::ExitError: Running `vipsthumbnail /tmp/shrine20220929-7090-sdf5tb.tif --eprofile /app/vendor/bundle/ruby/3.0.0/gems/kithe-2.6.1/lib/vendor/icc/sRGB2014.icc --delete --size 54x65500 -o /tmp/kithe_vips_cli_image_to_jpeg20220929-7090-hwxm6m.jpg\[Q\=85,interlace,optimize_coding,strip\]` failed with
exit status: 255
stdout: Nothing written
stderr: (vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 2048
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 90
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 40
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 50
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 60
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 70
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 80
vipsthumbnail: unable to thumbnail /tmp/shrine20220929-7090-sdf5tb.tif
TIFFFillStrip: Invalid strip byte count 0, strip 2
tiff2vips: read error

$ identify color_standards_and_8mibc4w_23_kz593ep.tiff # 
color_standards_and_8mibc4w_23_kz593ep.tiff TIFF 2467x3690 2467x3690+0+0 8-bit sRGB 14.4607MiB 0.000u 0:00.003
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify: Sanity check on directory count failed, this is probably not a valid IFD offset. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Failed to read custom directory at offset 0. `TIFFReadCustomDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.

Page 24

Link to asset for page 24
Download compressed file

Unfortunately, the ingest process for page 24 did not trigger any errors, despite the problems identify notes below. In particular, we were able to create derivatives, including even DZI tiles. We only caught the problem through visual inspection.

$ identify color_standards_and_8mibc4w_24_fw649l6.tiff # compressed file
color_standards_and_8mibc4w_24_fw649l6.tiff[0] TIFF 2486x3684 2486x3684+0+0 8-bit sRGB 26.2882MiB 0.000u 0:00.004
color_standards_and_8mibc4w_24_fw649l6.tiff[1] TIFF 108x160 108x160+0+0 8-bit sRGB 0.000u 0:00.000
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify: Wrong data type 3 for "PixelXDimension"; tag ignored. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/956.
identify: Wrong data type 3 for "PixelYDimension"; tag ignored. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/956.

jrochkind commented 1 year ago

I think that page 24 is a totally legal and valid TIFF, just visually not right. Those warnings about "tag ignored" from identify I think we get for all/most of our TIFFs and aren't a signal of anything. If 24 isn't a corrupted TIFF, but is just... not a correct scan -- there might be no way for us to catch it in an automated way.

eddierubeiz commented 1 year ago

Noting that we may want to also flag as problematic, on ingest, audio files that have some combination of :

null duration
null encoding
null bitrate
null bits per sample.

Without this metadata, these audio files are typically both unplayable and impossible to stitch together into combined audio derivatives.

As of Feb. 2023, it's possible to unknowingly ingest bad audio (in any of the senses above) and not notice that anything is wrong until the combined audio derivative creation process fails two minutes after the CAD job is enqueued.

jrochkind commented 1 year ago

There might be a better way to flag bad audio captures than using missing extracted metadata as a proxy, not sure. There are of course other reasons (bugs) there could be missing extracted metadata.

Can we just actually check an audio file for validity in some reliable way, as we are trying to do with images above, instead of just noticing that our metadata extraction process seems to have failed and figuring that means invalid input?

jrochkind commented 1 year ago

Perhaps we should prevent publishing of works that have assets that fail some sanity checks -- like an image missing thumbnails.

In general, our system doesn't currently have logic easily capable of answering the question "does this have missing derivatives" in general , the info isn't really represented like that, but more like commands for what to do on ingest. Maybe.

jrochkind commented 1 year ago

exiftool may be able to catch some corruption.

Running exiftool -api validate -a on Eddie's sample file here, i get included output:

Warning                         : Missing required TIFF IFD0 tag 0x0100 ImageWidth
Warning                         : Missing required TIFF IFD0 tag 0x0101 ImageHeight
Warning                         : Missing required TIFF IFD0 tag 0x0106 PhotometricInterpretation
Warning                         : Missing required TIFF IFD0 tag 0x0111 StripOffsets
Warning                         : Missing required TIFF IFD0 tag 0x0116 RowsPerStrip
Warning                         : Missing required TIFF IFD0 tag 0x0117 StripByteCounts
Warning                         : Missing required TIFF IFD0 tag 0x011a XResolution
Warning                         : Missing required TIFF IFD0 tag 0x011b YResolution

we may be using exiftool anyway for characterization metadata. Although if you ask exiftool for -json it seems to truncate the warning output, I'm not sure how to get ALL of it in json, more experimentation is possible.

Not totally sure what -api validate is doing, can't find it in exiftool docs, got it from first comment here: https://openpreservation.org/blogs/tiff-format-validation-easy-peasy/

jrochkind commented 1 year ago

The blog post at

also mentions "DPF Manager" -- not sure how maintained/workable it is, but it seems to be intended for archival perservation and TIFFs, checking that they are good. http://www.preforma-project.eu/dpf-manager.html

jrochkind commented 11 months ago

Another spec we think we want is:

If an Asset is NOT an OH portrait or a collection thumb, do not allow it to be a JPEG (or other image format?), image asset must be a TIFF!

jrochkind commented 11 months ago

@eddierubeiz @apinkney0696 Do we have any examples of bad/corrupt audio around? I know we encountered some before. But I can't find an example.

I need an example in order to make sure the validation catches it.

If we don't have such but run into some later, we can alwasy try to add them when we run into them, still setting up some infrastructure we can re-use.

apinkney0696 commented 11 months ago

Hmm. I don't know of any off the top of my head. @sarahschneiderSHI @archivistsarah or @rachellane12 perhaps might?

eddierubeiz commented 11 months ago

For compressed audio formats (mp3 and friends), you could try just removing a few kb from the innards of a non-corrupt file, using standard file utilities. I believe the file will then be messed up in a way that our utilities should catch.

jrochkind commented 11 months ago

@eddierubeiz I can definitely create random corrupt MP3s, but they don't all generate the same kind of error from (eg) exiftool.

So I wanted some actually encountered problem ones, so I could be sure we were catching those.

I think I am remembering we have encountered actual problem files here in practice? I can't remember for sure.

sarahschneiderSHI commented 11 months ago

I don't think I've encountered any corrupt audio files yet (or at least not that I was aware of).

archivistsarah commented 11 months ago

I don't have any that have been generated in the course of regular work

jrochkind commented 11 months ago

This was technically challenging because of some earlier architectural choices, and ended up kind of convoluted, but we think it's working, for some basic cases:

When TIFF is detected as corrupt, for some actual kinds of corrupt TIFF we have encountered (can't promise it will get all corruption), #2395
When our system could not determine the file type of the file at all (probably means it's corrupt, also catches zero-byte empty files). #2403
For audio files which our system can't detect a duration, bitrate, or sample rate: #2406

When assets fail ingest, we should get alerted in #digtal-technical slack channel. There will also be a big warning showing up in admin dashboard sidebar.

However, at present nothing keeps you from publishing a work that has failed ingest attached to it... perhaps it should, maybe we need another ticket? (@apinkney0696 ). This all gets very complicated, lots of loose ends!

One thing mentioned that ended up NOT included here, because it really needs a different technical approach, about detecting JPG format when TIFF is intended, is now recorded at: #2407

apinkney0696 commented 11 months ago

Wow, lots of moving parts. Thanks for laying it out so clearly. I like the idea of adding a preventative measure to keep staff from publishing a work with a failed ingest. I will make a ticket.

Do we know if there are any corrupt files still in the repository now? I remember replacing a few that were giving us issues with OCR recently, but I can't remember if we've done a sweep of all assets.

jrochkind commented 11 months ago

@apinkney0696 Good point, I'm not 100% sure, maybe yet another ticket to do another sweep, specifically of files corrupt in the kinds of ways we guarded for in this ticket (we can only look for what we know how to find!)

rachellane12 commented 11 months ago

@jrochkind @apinkney0696 I will add from the oral history side, at least since @sarahschneiderSHI and I have been adding new interviews to the DC, we have moved the scrubber to various parts of the audio of the oral history interview to try and catch any weirdness with the audio files. That's how I caught Malcom's corrupt files a while back (which we fixed).

jrochkind commented 11 months ago

Thanks @rachellane12 I'm actually not totally sure if we know how to automatically detect that particular kind of corrupt audio file that came up in Malcom.

But if you encounter any in the future (or even if you still have the corrupt Malcom files on hand), it's good to let us know, so we can see if we can automatically detect them! We have to have the example bad file to try to figure out how to automatically detect it (in some cases we may not be able to, if the file is technically good but just has wrong audio).

rachellane12 commented 11 months ago

@jrochkind Unfortunately we don't still have the corrupt Malcom files, but I will let you know in the future of any issues and keep the bad files so you can use them as needed.

sciencehistory / scihist_digicoll

Catch corrupt original files (graphics and sound) earlier and more cleanly #1652

Page 23

Page 24