Closed jrochkind closed 10 months ago
Example of such an apparently corrupt tiff ( I compressed it before attaching it to this ticket because Github wouldn't allow me to attach a raw tiff
file. Please decompress tiff the file before examining it.)
i'm gonna move this one back to "backlog", we have other stuff prioritized and don't really plan to do this soon
@eddierubeiz When you attach those two sample files (one from a couple months ago, one from last week), they show up as .tar.gz
-- am I right that you are .tar.gz'ing them just to attach them, and the original file that was uploaded to try to ingest in our system is file I get if I un-tar-gz it, presumably a file whose suffix is .tiff?
Yes to all the above questions. I'm going to reattach the one from last week.
@eddierubeiz Unfortunately, when I try to open the file I downloaded above, from blasting_accessories_blas_ph2s05f_48_m1zy3an.tiff.gz
, I get "The archive is empty". And indeed it's only 99 bytes big, so.
Let's see about the one from two months ago... That one does work! So I have at least one example, okeydoke.
Not sure what's going on with the difficulties attaching this latest one! Are you using an unusual method of making the .gz?
Let's try this again!
Clarification: the only reason the images above are compressed is because you can't attach a tiff to a github ticket. To view the file as I downloaded it from the digital collections, decompress it first. The file that failed derivative generation has a .tiff extension.
Finding a tool that can cheaply and accurately identify these "bad TIFFs" is harder than I anticipated... maybe we do just need to rescue the error on trying to create derivatives, but that's harder to fit into the architecture.
@apinkney0696 if you have a second, I'm curious what you know about the nature of these "bad" TIFFs created by "Capture One". Do you know anything about what causes them and what they are like?
We established that it's Capture One problem; Annabel has already contacted Capture One about it. The outcome was that Annabel can avoid the problem by exporting to a local drive rather than to a network drive.
Thanks @eddierubeiz . I'm curious if we know anything more than that about the nature of the corruption, what Capture One is doing. Because it might help me figure out how to identify the corrupt files. But I understand we may not.
I'm proably going to take a break from this, so documenting where I got:
How to identify invalid tiffs?
No space for TIFF directory
, along with a non-0 exit code. libtiff-tools
. libtiff
package, which was already a dependency of vips etc. (brew has the tiffdump command line in libtiff
package itself, while apt segregates it in a separate libtiff-tools
)identify
also maybe could do it, but we don't really intentnionally have an imagemagick dependency now, it's sort of an accident transient dependencyCode architecture of where to put the code
throw :abort
succeeds in cancelling shrine "promotion" -- not only will derivatives not be run, but the file wont' even be promoted to "store" storage, it really won't be ingested at all, great!metadata
seemed a good place to me....before_promotion
hook in a way that stuck! It seemed like perhaps cancelling promotion also made it impossible to persist any changes to metadata? But I didn't spend a lot of time with it, needs more investigation. If necessary and helpful, we could create an additional attr_json attribute just for ingest errors... but file metadata still seems best to me, precisely becuase it's associated with the file, and will be automatically cleared out if the file were to change. Needs more experimentation to try to get to work. Last night we were hit by two variations on this, on two consecutive pages of Color Standards and Nomenclature:
vipsthumbnail
against the file:
TTY::Command::ExitError: Running `vipsthumbnail /tmp/shrine20220929-7090-sdf5tb.tif --eprofile /app/vendor/bundle/ruby/3.0.0/gems/kithe-2.6.1/lib/vendor/icc/sRGB2014.icc --delete --size 54x65500 -o /tmp/kithe_vips_cli_image_to_jpeg20220929-7090-hwxm6m.jpg\[Q\=85,interlace,optimize_coding,strip\]` failed with
exit status: 255
stdout: Nothing written
stderr: (vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 2048
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 90
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 40
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 50
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 60
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 70
(vipsthumbnail:7118): VIPS-WARNING **: 17:02:47.025: error in tile 0 x 80
vipsthumbnail: unable to thumbnail /tmp/shrine20220929-7090-sdf5tb.tif
TIFFFillStrip: Invalid strip byte count 0, strip 2
tiff2vips: read error
$ identify color_standards_and_8mibc4w_23_kz593ep.tiff #
color_standards_and_8mibc4w_23_kz593ep.tiff TIFF 2467x3690 2467x3690+0+0 8-bit sRGB 14.4607MiB 0.000u 0:00.003
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify: Sanity check on directory count failed, this is probably not a valid IFD offset. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Failed to read custom directory at offset 0. `TIFFReadCustomDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify
notes below.
In particular, we were able to create derivatives, including even DZI tiles. We only caught the problem through visual inspection.
$ identify color_standards_and_8mibc4w_24_fw649l6.tiff # compressed file
color_standards_and_8mibc4w_24_fw649l6.tiff[0] TIFF 2486x3684 2486x3684+0+0 8-bit sRGB 26.2882MiB 0.000u 0:00.004
color_standards_and_8mibc4w_24_fw649l6.tiff[1] TIFF 108x160 108x160+0+0 8-bit sRGB 0.000u 0:00.000
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify: Wrong data type 3 for "PixelXDimension"; tag ignored. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/956.
identify: Wrong data type 3 for "PixelYDimension"; tag ignored. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/956.
I think that page 24 is a totally legal and valid TIFF, just visually not right. Those warnings about "tag ignored" from identify
I think we get for all/most of our TIFFs and aren't a signal of anything. If 24 isn't a corrupted TIFF, but is just... not a correct scan -- there might be no way for us to catch it in an automated way.
Noting that we may want to also flag as problematic, on ingest, audio files that have some combination of :
Without this metadata, these audio files are typically both unplayable and impossible to stitch together into combined audio derivatives.
As of Feb. 2023, it's possible to unknowingly ingest bad audio (in any of the senses above) and not notice that anything is wrong until the combined audio derivative creation process fails two minutes after the CAD job is enqueued.
There might be a better way to flag bad audio captures than using missing extracted metadata as a proxy, not sure. There are of course other reasons (bugs) there could be missing extracted metadata.
Can we just actually check an audio file for validity in some reliable way, as we are trying to do with images above, instead of just noticing that our metadata extraction process seems to have failed and figuring that means invalid input?
Perhaps we should prevent publishing of works that have assets that fail some sanity checks -- like an image missing thumbnails.
In general, our system doesn't currently have logic easily capable of answering the question "does this have missing derivatives" in general , the info isn't really represented like that, but more like commands for what to do on ingest. Maybe.
exiftool may be able to catch some corruption.
Running exiftool -api validate -a
on Eddie's sample file here, i get included output:
Warning : Missing required TIFF IFD0 tag 0x0100 ImageWidth
Warning : Missing required TIFF IFD0 tag 0x0101 ImageHeight
Warning : Missing required TIFF IFD0 tag 0x0106 PhotometricInterpretation
Warning : Missing required TIFF IFD0 tag 0x0111 StripOffsets
Warning : Missing required TIFF IFD0 tag 0x0116 RowsPerStrip
Warning : Missing required TIFF IFD0 tag 0x0117 StripByteCounts
Warning : Missing required TIFF IFD0 tag 0x011a XResolution
Warning : Missing required TIFF IFD0 tag 0x011b YResolution
we may be using exiftool anyway for characterization metadata. Although if you ask exiftool for -json
it seems to truncate the warning output, I'm not sure how to get ALL of it in json, more experimentation is possible.
Not totally sure what -api validate
is doing, can't find it in exiftool docs, got it from first comment here: https://openpreservation.org/blogs/tiff-format-validation-easy-peasy/
The blog post at
also mentions "DPF Manager" -- not sure how maintained/workable it is, but it seems to be intended for archival perservation and TIFFs, checking that they are good. http://www.preforma-project.eu/dpf-manager.html
Another spec we think we want is:
If an Asset is NOT an OH portrait or a collection thumb, do not allow it to be a JPEG (or other image format?), image asset must be a TIFF!
@eddierubeiz @apinkney0696 Do we have any examples of bad/corrupt audio around? I know we encountered some before. But I can't find an example.
I need an example in order to make sure the validation catches it.
If we don't have such but run into some later, we can alwasy try to add them when we run into them, still setting up some infrastructure we can re-use.
Hmm. I don't know of any off the top of my head. @sarahschneiderSHI @archivistsarah or @rachellane12 perhaps might?
For compressed audio formats (mp3 and friends), you could try just removing a few kb from the innards of a non-corrupt file, using standard file utilities. I believe the file will then be messed up in a way that our utilities should catch.
@eddierubeiz I can definitely create random corrupt MP3s, but they don't all generate the same kind of error from (eg) exiftool.
So I wanted some actually encountered problem ones, so I could be sure we were catching those.
I think I am remembering we have encountered actual problem files here in practice? I can't remember for sure.
I don't think I've encountered any corrupt audio files yet (or at least not that I was aware of).
I don't have any that have been generated in the course of regular work
This was technically challenging because of some earlier architectural choices, and ended up kind of convoluted, but we think it's working, for some basic cases:
When assets fail ingest, we should get alerted in #digtal-technical slack channel. There will also be a big warning showing up in admin dashboard sidebar.
However, at present nothing keeps you from publishing a work that has failed ingest attached to it... perhaps it should, maybe we need another ticket? (@apinkney0696 ). This all gets very complicated, lots of loose ends!
One thing mentioned that ended up NOT included here, because it really needs a different technical approach, about detecting JPG format when TIFF is intended, is now recorded at: #2407
Wow, lots of moving parts. Thanks for laying it out so clearly. I like the idea of adding a preventative measure to keep staff from publishing a work with a failed ingest. I will make a ticket.
Do we know if there are any corrupt files still in the repository now? I remember replacing a few that were giving us issues with OCR recently, but I can't remember if we've done a sweep of all assets.
@apinkney0696 Good point, I'm not 100% sure, maybe yet another ticket to do another sweep, specifically of files corrupt in the kinds of ways we guarded for in this ticket (we can only look for what we know how to find!)
@jrochkind @apinkney0696 I will add from the oral history side, at least since @sarahschneiderSHI and I have been adding new interviews to the DC, we have moved the scrubber to various parts of the audio of the oral history interview to try and catch any weirdness with the audio files. That's how I caught Malcom's corrupt files a while back (which we fixed).
Thanks @rachellane12 I'm actually not totally sure if we know how to automatically detect that particular kind of corrupt audio file that came up in Malcom.
But if you encounter any in the future (or even if you still have the corrupt Malcom files on hand), it's good to let us know, so we can see if we can automatically detect them! We have to have the example bad file to try to figure out how to automatically detect it (in some cases we may not be able to, if the file is technically good but just has wrong audio).
@jrochkind Unfortunately we don't still have the corrupt Malcom files, but I will let you know in the future of any issues and keep the bad files so you can use them as needed.
Sometimes our "Capture One" digitization pipeline produces TIFFs that are corrupt in some way.
They will make our ingest pipeline fail with an error such as:
vips dzsave
, magick2vips: libMagick error: Sanity check on directory count failed, zero tag directories not supported. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/604. (https://app.honeybadger.io/projects/58989/faults/84983034)vipsthumbnail
:We're not exactly sure what's up with these TIFFs.
file
andmediainfo
both report a sample as image/tiff, without complaint.However, imagemagick
identify
will say:And
ffprobe
(which it turns out, weirdly, you can use on a TIFF), will output to stderr:Every time this happens, we get an error on failed ingest/derivative creation, and need to debug it a bit to figure out why it failed -- oh, corrupt TIFF.
identify
orffprobe
on it in the early ingest pipeline, and abort ingest intentionally if that reveals it as this weirdness?