sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
13 stars 0 forks source link

Characterization bug: certain mp3 files are not recognized as having audio/mpeg content. #328

Closed eddierubeiz closed 5 years ago

eddierubeiz commented 5 years ago

Take a look at the two following assets:

https://kithe.sciencehistory.org/admin/asset_files/rujbxiy https://kithe.sciencehistory.org/admin/asset_files/c676fil

Downloading them, then running the file command on them yields: fenselau_c_0710_1-2.mp3: MPEG ADTS, layer III, v2, 128 kbps, 22.05 kHz, Monaural fenselau_c_0710_1-3.mp3: MPEG ADTS, layer III, v2, 128 kbps, 22.05 kHz, Monaural

The resulting content type is application/octet-stream and audio derivatives are never generated.

Compare e.g. https://kithe.sciencehistory.org/admin/asset_files/x633f206c mark_h_0030_1-1.mp3: Audio file with ID3 version 2.4.0, contains:MPEG ADTS, layer III, v1, 64 kbps, 48 kHz, Monaural

which is correctly characterized as audio/mpeg.

jrochkind commented 5 years ago

Hmm. You're saying our file characterization is improperly characterizing them, our app doesn't know they are mp3, so they don't get derivatives? That is annoying. (Is this happening to most/all of our audio, or just certain things?)

Let me sanity check...

I download: fenselau_c_0710_1-2.mp3

Our file characterization is, under the hood, uses marcel tool to determine content-type. Let me try that manually on that file.

Marcel::MimeType.for Pathname.new("/Users/jrochkind/Downloads/fenselau_c_0710_1-2.mp3")

Indeed returns application/octet-stream. doh!

Hmm, file says it's "MPEG ADTS" -- I was wondering if it wasn't really an mp3, but that is the same thing file says for things we believe are mp3. And marcel can correctly identify our ice_cubes.mp3 as mp3.

There is not a newer version of marcel then the one we have.

I think we should report this as a bug to marcel. But that requires us to give them a sample file somehow -- do you want to try creating a much shorter excerpt of this that still reproduces? I wonder if any monaural mp3 would be missed by marcel.

But for now, we are going to need to investigate other content-type sniffers I think -- OR configure it to trust the file ending mp3 if it can't figure it out otherwise. Hmm.

eddierubeiz commented 5 years ago

Yeah -- I would actually guess it's either the missing ID3 version 2.4.0 thing, or the layer III, v1 vs. layer III, v2 thing. From what I know about id3 tags, they're not integral to mp3, so probably the latter. Avenues for further research.

mark_h_0030_1-1.mp3 is mono, so that's probably not the problem.

jrochkind commented 5 years ago

Technically ADTS is not really the same as mp3 though -- as far as I can tell. I wonder which one we really have! Media file types are so confusing!

eddierubeiz commented 5 years ago

Yeah. Some interesting discussion of this at https://unix.stackexchange.com/questions/177236/lossless-conversion-of-mpeg-adts-mp3-to-normal-mp3

jrochkind commented 5 years ago

An actual sample mp3 file I downloaded from the web gets identified by file as "Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 224 kbps, 32 kHz, Stereo".

(I don't know why the ice_cubes one you found also gets identified as ADTS).

One sanity check, with @sanfordd -- is that fenselau_c_0710_1-2.mp3 file an "original" from OH? It didn't go through any of our workflow for transcoding audio? If it didn't go through any of our workflow, it means it's not a bug in our workflow -- but still, I wonder is it really an MP3 or an ADTS? if the latter (and they are different?) should we tell OH that some of their mp3 files aren't really mp3s?

How would we figure out if it's really ADTS or MP3? Other than trusting file on MacOS? Or are those actually enough the same thing that nobody distinguishes?

eddierubeiz commented 5 years ago

Technically they all contain ADTS streams -- it's the "contains" part that's missing from the fenselau files. I wonder whether we could just add an id3 tag and this problem would go away.

jrochkind commented 5 years ago

We probably want to tell OH that their MP3 originals are actually ADTS, if that's what's going on.

Some of the answers on the StackExchange you linked to suggest something different than "Technically they all contain ADTS streams" to me though. The top answer says: "There is no way to convert MPEG ADTS to MP3 without decoding and reencoding them. They are fundamentally different formats/encodings."

jrochkind commented 5 years ago

Ruby MimeMagic correctly determines audio/mpeg. (It may be that ADTS and MP3 are different, but they are both audio/mpeg?).

We can easily change from using marcel to using mimemagic, and should probably do so then.

jrochkind commented 5 years ago

Alternately, we could configure to use marcel, but fallback to mimemagick when marcel can't find anything. Not sure the pro's and con's. Probably just switch to mimemagic is fine.

eddierubeiz commented 5 years ago

id3tag -aartist fenselau_c_0710_1-3.mp3 fixes the problem. See https://kithe.sciencehistory.org/admin/asset_files/tfnxcyp .

jrochkind commented 5 years ago

Very good to know. But if that's an original from OH, we probably don't want to mess with it if we don't have to -- and it looks like we don't have to, there's another reasonable ruby magic-byte analyzer that figures it out succesfully. (And falling back to trusting mp3 would be another option).

jrochkind commented 5 years ago

I am still very curious why marcel fails here and mimemagic succeeds -- marcel looks like it starts with mimemagic, and just adds some more types on top, it doesn't look like it should be possible for it to fail where mimemagic succeeds.

I might spend some more time debugging marcel, perhaps to file a ticket with marcel.

sanfordd commented 5 years ago

@jrochkind In our application a MP3 file was used as an original, no transcoding was done. I did a quick check of the original file with the file command and it's reading the mimetype as audio/mpeg

jrochkind commented 5 years ago

Uh oh, investigated further, I was wrong, mimemagic can't correctly identify the file either.

Apparently mimemagic's magic byte database isn't as good as the one on MacOS and ubuntu, somehow.

  1. We could resort to trusting the file suffix if we can't figure out the content type from byte analysis. (All our content is uploaded by staff, so nobody should be trying to maliciously spoof us)

  2. We could resort to the command-line file utility run on the server, when the ruby analysis can't figure it out. But since different hosts will have different versions of file with differnet dbs, this means there could be edge cases where different determinations are made and it's hard to debug (on dev vs prod; when we upgrade the OS it changes; etc).

  3. We could manipulate the files as you suggest to 'trick' the magic byte detector. This has potential archival considerations, since these are 'originals', I think.

Hmm.

sanfordd commented 5 years ago

The biggest consideration is that any id3 tag edits will change the checksum of the file if we do want to compare it to the copy Oral Histories has.

eddierubeiz commented 5 years ago

I'm torn between 1 and 2. We've had incidents where staff have managed to mislabel files that are not TIFF files as .tif, so I'm reluctantly leaning towards 2. Recipe: Test the file with marcel. If the resulting type is unknown, then (and only then) resort to: file -b --mime-type original.mp3.

jrochkind commented 5 years ago

@sanfordd writes in Slack that in the fenselau case, the mp3's were mistakenly used, our ingest process should have used the available WAV file to create a FLAC as the "original" in our app instead.

However, there are other cases where mp3 "originals" are and have been used. @eddierubeiz , can you spot check some of them to see if they can be correctly identified? Daniel says the Herman Mark OH's are some examples.

(As above, rather than go through the entire app, you can spot check in a console with: Marcel::MimeType.for Pathname.new(path))

If we can't find any actual desired originals that fail content type detection, maybe we don't need to worry about it for now.

sanfordd commented 5 years ago

We could also look at updating our local mimemagic list of files (https://github.com/minad/mimemagic#extra-magic-overlay) as well as calling out to file if it comes up as an issue.

jrochkind commented 5 years ago

It is not obvious to me how to figure out the right magic byte overlay to correctly identify these odd ADTS mp3s, but I guess we could figure it out somehow!

eddierubeiz commented 5 years ago

For the Herman Mark oral history original mp3s, they were correctly described as audio in all cases by New Thing.

jrochkind commented 5 years ago

I think maybe we're no longer concerned about this issue, as the "certain" MP3s were mistakes, and we may not run into any other such certain MP3s in real production.

It turned out not to be completely trivial to fix this.

The fix would probably be falling back to shell out to unix file command in cases where the ruby content-type detector can't come up with a type. As file managed to for these files. One downside of that is different versions of file on different OSs/versions may not come up with the same answers, making reproducing problems somewhat more confusing.

jrochkind commented 5 years ago

Won't work on for now, since we're not sure we'll encounter it with files we mean to ingest.