sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
11 stars 0 forks source link

Catch corrupt original files (graphics and sound) earlier and more cleanly #1652

Closed jrochkind closed 10 months ago

jrochkind commented 2 years ago

Sometimes our "Capture One" digitization pipeline produces TIFFs that are corrupt in some way.

They will make our ingest pipeline fail with an error such as:

We're not exactly sure what's up with these TIFFs. file and mediainfo both report a sample as image/tiff, without complaint.

However, imagemagick identify will say:

identify  ~/Desktop/soda_fountain_beverages_ncta7o9_156_4oaxl62.tiff
identify: Sanity check on directory count failed, zero tag directories not supported. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Failed to read directory at offset 41751608. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/596.`

And ffprobe (which it turns out, weirdly, you can use on a TIFF), will output to stderr:

[tiff @ 0x122f043d0] IFD offset is greater than image size
[tiff_pipe @ 0x122e05a00] Could not find codec parameters for stream 0 (Video: tiff, none): unspecified size
Consider increasing the value for the 'analyzeduration' (0) and 'probesize' (5000000) options

Every time this happens, we get an error on failed ingest/derivative creation, and need to debug it a bit to figure out why it failed -- oh, corrupt TIFF.

eddierubeiz commented 2 years ago

Example of such an apparently corrupt tiff ( I compressed it before attaching it to this ticket because Github wouldn't allow me to attach a raw tiff file. Please decompress tiff the file before examining it.)

soda_fountain_beverages_ncta7o9_156_4oaxl62.tiff.gz

jrochkind commented 2 years ago

i'm gonna move this one back to "backlog", we have other stuff prioritized and don't really plan to do this soon

jrochkind commented 2 years ago

@eddierubeiz When you attach those two sample files (one from a couple months ago, one from last week), they show up as .tar.gz -- am I right that you are .tar.gz'ing them just to attach them, and the original file that was uploaded to try to ingest in our system is file I get if I un-tar-gz it, presumably a file whose suffix is .tiff?

eddierubeiz commented 2 years ago

Yes to all the above questions. I'm going to reattach the one from last week.

jrochkind commented 2 years ago

@eddierubeiz Unfortunately, when I try to open the file I downloaded above, from blasting_accessories_blas_ph2s05f_48_m1zy3an.tiff.gz, I get "The archive is empty". And indeed it's only 99 bytes big, so.

Let's see about the one from two months ago... That one does work! So I have at least one example, okeydoke.

Not sure what's going on with the difficulties attaching this latest one! Are you using an unusual method of making the .gz?

eddierubeiz commented 2 years ago

Let's try this again!

Clarification: the only reason the images above are compressed is because you can't attach a tiff to a github ticket. To view the file as I downloaded it from the digital collections, decompress it first. The file that failed derivative generation has a .tiff extension.

blasting_accessories_blas_ph2s05f_48_m1zy3an.tiff.gz

jrochkind commented 2 years ago

Finding a tool that can cheaply and accurately identify these "bad TIFFs" is harder than I anticipated... maybe we do just need to rescue the error on trying to create derivatives, but that's harder to fit into the architecture.

@apinkney0696 if you have a second, I'm curious what you know about the nature of these "bad" TIFFs created by "Capture One". Do you know anything about what causes them and what they are like?

eddierubeiz commented 2 years ago

We established that it's Capture One problem; Annabel has already contacted Capture One about it. The outcome was that Annabel can avoid the problem by exporting to a local drive rather than to a network drive.

jrochkind commented 2 years ago

Thanks @eddierubeiz . I'm curious if we know anything more than that about the nature of the corruption, what Capture One is doing. Because it might help me figure out how to identify the corrupt files. But I understand we may not.

jrochkind commented 2 years ago

I'm proably going to take a break from this, so documenting where I got:

eddierubeiz commented 1 year ago

Last night we were hit by two variations on this, on two consecutive pages of Color Standards and Nomenclature:


Page 23

$ identify color_standards_and_8mibc4w_23_kz593ep.tiff # 
color_standards_and_8mibc4w_23_kz593ep.tiff TIFF 2467x3690 2467x3690+0+0 8-bit sRGB 14.4607MiB 0.000u 0:00.003
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.
identify: Sanity check on directory count failed, this is probably not a valid IFD offset. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Failed to read custom directory at offset 0. `TIFFReadCustomDirectory' @ error/tiff.c/TIFFErrors/596.
identify: Incorrect value for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/956.

Page 24

jrochkind commented 1 year ago

I think that page 24 is a totally legal and valid TIFF, just visually not right. Those warnings about "tag ignored" from identify I think we get for all/most of our TIFFs and aren't a signal of anything. If 24 isn't a corrupted TIFF, but is just... not a correct scan -- there might be no way for us to catch it in an automated way.

eddierubeiz commented 1 year ago

Noting that we may want to also flag as problematic, on ingest, audio files that have some combination of :

Without this metadata, these audio files are typically both unplayable and impossible to stitch together into combined audio derivatives.

As of Feb. 2023, it's possible to unknowingly ingest bad audio (in any of the senses above) and not notice that anything is wrong until the combined audio derivative creation process fails two minutes after the CAD job is enqueued.

jrochkind commented 1 year ago

There might be a better way to flag bad audio captures than using missing extracted metadata as a proxy, not sure. There are of course other reasons (bugs) there could be missing extracted metadata.

Can we just actually check an audio file for validity in some reliable way, as we are trying to do with images above, instead of just noticing that our metadata extraction process seems to have failed and figuring that means invalid input?

jrochkind commented 1 year ago

Perhaps we should prevent publishing of works that have assets that fail some sanity checks -- like an image missing thumbnails.

In general, our system doesn't currently have logic easily capable of answering the question "does this have missing derivatives" in general , the info isn't really represented like that, but more like commands for what to do on ingest. Maybe.

jrochkind commented 1 year ago

exiftool may be able to catch some corruption.

Running exiftool -api validate -a on Eddie's sample file here, i get included output:

Warning                         : Missing required TIFF IFD0 tag 0x0100 ImageWidth
Warning                         : Missing required TIFF IFD0 tag 0x0101 ImageHeight
Warning                         : Missing required TIFF IFD0 tag 0x0106 PhotometricInterpretation
Warning                         : Missing required TIFF IFD0 tag 0x0111 StripOffsets
Warning                         : Missing required TIFF IFD0 tag 0x0116 RowsPerStrip
Warning                         : Missing required TIFF IFD0 tag 0x0117 StripByteCounts
Warning                         : Missing required TIFF IFD0 tag 0x011a XResolution
Warning                         : Missing required TIFF IFD0 tag 0x011b YResolution

we may be using exiftool anyway for characterization metadata. Although if you ask exiftool for -json it seems to truncate the warning output, I'm not sure how to get ALL of it in json, more experimentation is possible.

Not totally sure what -api validate is doing, can't find it in exiftool docs, got it from first comment here: https://openpreservation.org/blogs/tiff-format-validation-easy-peasy/

jrochkind commented 1 year ago

The blog post at

also mentions "DPF Manager" -- not sure how maintained/workable it is, but it seems to be intended for archival perservation and TIFFs, checking that they are good. http://www.preforma-project.eu/dpf-manager.html

jrochkind commented 11 months ago

Another spec we think we want is:

If an Asset is NOT an OH portrait or a collection thumb, do not allow it to be a JPEG (or other image format?), image asset must be a TIFF!

jrochkind commented 11 months ago

@eddierubeiz @apinkney0696 Do we have any examples of bad/corrupt audio around? I know we encountered some before. But I can't find an example.

I need an example in order to make sure the validation catches it.

If we don't have such but run into some later, we can alwasy try to add them when we run into them, still setting up some infrastructure we can re-use.

apinkney0696 commented 11 months ago

Hmm. I don't know of any off the top of my head. @sarahschneiderSHI @archivistsarah or @rachellane12 perhaps might?

eddierubeiz commented 11 months ago

For compressed audio formats (mp3 and friends), you could try just removing a few kb from the innards of a non-corrupt file, using standard file utilities. I believe the file will then be messed up in a way that our utilities should catch.

jrochkind commented 11 months ago

@eddierubeiz I can definitely create random corrupt MP3s, but they don't all generate the same kind of error from (eg) exiftool.

So I wanted some actually encountered problem ones, so I could be sure we were catching those.

I think I am remembering we have encountered actual problem files here in practice? I can't remember for sure.

sarahschneiderSHI commented 11 months ago

I don't think I've encountered any corrupt audio files yet (or at least not that I was aware of).

archivistsarah commented 11 months ago

I don't have any that have been generated in the course of regular work

jrochkind commented 11 months ago

This was technically challenging because of some earlier architectural choices, and ended up kind of convoluted, but we think it's working, for some basic cases:

When assets fail ingest, we should get alerted in #digtal-technical slack channel. There will also be a big warning showing up in admin dashboard sidebar.

However, at present nothing keeps you from publishing a work that has failed ingest attached to it... perhaps it should, maybe we need another ticket? (@apinkney0696 ). This all gets very complicated, lots of loose ends!

One thing mentioned that ended up NOT included here, because it really needs a different technical approach, about detecting JPG format when TIFF is intended, is now recorded at: #2407

apinkney0696 commented 11 months ago

Wow, lots of moving parts. Thanks for laying it out so clearly. I like the idea of adding a preventative measure to keep staff from publishing a work with a failed ingest. I will make a ticket.

Do we know if there are any corrupt files still in the repository now? I remember replacing a few that were giving us issues with OCR recently, but I can't remember if we've done a sweep of all assets.

jrochkind commented 11 months ago

@apinkney0696 Good point, I'm not 100% sure, maybe yet another ticket to do another sweep, specifically of files corrupt in the kinds of ways we guarded for in this ticket (we can only look for what we know how to find!)

rachellane12 commented 11 months ago

@jrochkind @apinkney0696 I will add from the oral history side, at least since @sarahschneiderSHI and I have been adding new interviews to the DC, we have moved the scrubber to various parts of the audio of the oral history interview to try and catch any weirdness with the audio files. That's how I caught Malcom's corrupt files a while back (which we fixed).

jrochkind commented 11 months ago

Thanks @rachellane12 I'm actually not totally sure if we know how to automatically detect that particular kind of corrupt audio file that came up in Malcom.

But if you encounter any in the future (or even if you still have the corrupt Malcom files on hand), it's good to let us know, so we can see if we can automatically detect them! We have to have the example bad file to try to figure out how to automatically detect it (in some cases we may not be able to, if the file is technically good but just has wrong audio).

rachellane12 commented 11 months ago

@jrochkind Unfortunately we don't still have the corrupt Malcom files, but I will let you know in the future of any issues and keep the bad files so you can use them as needed.