Closed eddierubeiz closed 5 years ago
Hmm. You're saying our file characterization is improperly characterizing them, our app doesn't know they are mp3, so they don't get derivatives? That is annoying. (Is this happening to most/all of our audio, or just certain things?)
Let me sanity check...
I download: fenselau_c_0710_1-2.mp3
Our file characterization is, under the hood, uses marcel tool to determine content-type. Let me try that manually on that file.
Marcel::MimeType.for Pathname.new("/Users/jrochkind/Downloads/fenselau_c_0710_1-2.mp3")
Indeed returns application/octet-stream
. doh!
Hmm, file
says it's "MPEG ADTS" -- I was wondering if it wasn't really an mp3, but that is the same thing file
says for things we believe are mp3. And marcel
can correctly identify our ice_cubes.mp3 as mp3.
There is not a newer version of marcel
then the one we have.
I think we should report this as a bug to marcel. But that requires us to give them a sample file somehow -- do you want to try creating a much shorter excerpt of this that still reproduces? I wonder if any monaural mp3 would be missed by marcel.
But for now, we are going to need to investigate other content-type sniffers I think -- OR configure it to trust the file ending mp3
if it can't figure it out otherwise. Hmm.
Yeah -- I would actually guess it's either the missing ID3 version 2.4.0
thing, or the layer III, v1
vs. layer III, v2
thing. From what I know about id3 tags, they're not integral to mp3, so probably the latter. Avenues for further research.
mark_h_0030_1-1.mp3
is mono, so that's probably not the problem.
Technically ADTS is not really the same as mp3 though -- as far as I can tell. I wonder which one we really have! Media file types are so confusing!
Yeah. Some interesting discussion of this at https://unix.stackexchange.com/questions/177236/lossless-conversion-of-mpeg-adts-mp3-to-normal-mp3
An actual sample mp3 file I downloaded from the web gets identified by file
as "Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 224 kbps, 32 kHz, Stereo".
(I don't know why the ice_cubes one you found also gets identified as ADTS
).
One sanity check, with @sanfordd -- is that fenselau_c_0710_1-2.mp3
file an "original" from OH? It didn't go through any of our workflow for transcoding audio? If it didn't go through any of our workflow, it means it's not a bug in our workflow -- but still, I wonder is it really an MP3 or an ADTS? if the latter (and they are different?) should we tell OH that some of their mp3 files aren't really mp3s?
How would we figure out if it's really ADTS or MP3? Other than trusting file
on MacOS? Or are those actually enough the same thing that nobody distinguishes?
Technically they all contain ADTS streams -- it's the "contains" part that's missing from the fenselau files. I wonder whether we could just add an id3 tag and this problem would go away.
We probably want to tell OH that their MP3 originals are actually ADTS, if that's what's going on.
Some of the answers on the StackExchange you linked to suggest something different than "Technically they all contain ADTS streams" to me though. The top answer says: "There is no way to convert MPEG ADTS to MP3 without decoding and reencoding them. They are fundamentally different formats/encodings."
Ruby MimeMagic
correctly determines audio/mpeg
. (It may be that ADTS and MP3 are different, but they are both audio/mpeg
?).
We can easily change from using marcel to using mimemagic, and should probably do so then.
Alternately, we could configure to use marcel
, but fallback to mimemagick
when marcel can't find anything. Not sure the pro's and con's. Probably just switch to mimemagic is fine.
id3tag -aartist fenselau_c_0710_1-3.mp3
fixes the problem. See https://kithe.sciencehistory.org/admin/asset_files/tfnxcyp .
Very good to know. But if that's an original from OH, we probably don't want to mess with it if we don't have to -- and it looks like we don't have to, there's another reasonable ruby magic-byte analyzer that figures it out succesfully. (And falling back to trusting mp3
would be another option).
I am still very curious why marcel
fails here and mimemagic
succeeds -- marcel looks like it starts with mimemagic, and just adds some more types on top, it doesn't look like it should be possible for it to fail where mimemagic succeeds.
I might spend some more time debugging marcel, perhaps to file a ticket with marcel.
@jrochkind In our application a MP3 file was used as an original, no transcoding was done. I did a quick check of the original file with the file command and it's reading the mimetype as audio/mpeg
Uh oh, investigated further, I was wrong, mimemagic
can't correctly identify the file either.
Apparently mimemagic's magic byte database isn't as good as the one on MacOS and ubuntu, somehow.
We could resort to trusting the file suffix if we can't figure out the content type from byte analysis. (All our content is uploaded by staff, so nobody should be trying to maliciously spoof us)
We could resort to the command-line file
utility run on the server, when the ruby analysis can't figure it out. But since different hosts will have different versions of file
with differnet dbs, this means there could be edge cases where different determinations are made and it's hard to debug (on dev vs prod; when we upgrade the OS it changes; etc).
We could manipulate the files as you suggest to 'trick' the magic byte detector. This has potential archival considerations, since these are 'originals', I think.
Hmm.
The biggest consideration is that any id3 tag edits will change the checksum of the file if we do want to compare it to the copy Oral Histories has.
I'm torn between 1 and 2. We've had incidents where staff have managed to mislabel files that are not TIFF files as .tif, so I'm reluctantly leaning towards 2.
Recipe:
Test the file with marcel. If the resulting type is unknown, then (and only then) resort to:
file -b --mime-type original.mp3
.
@sanfordd writes in Slack that in the fenselau case, the mp3's were mistakenly used, our ingest process should have used the available WAV file to create a FLAC as the "original" in our app instead.
However, there are other cases where mp3 "originals" are and have been used. @eddierubeiz , can you spot check some of them to see if they can be correctly identified? Daniel says the Herman Mark OH's are some examples.
(As above, rather than go through the entire app, you can spot check in a console with: Marcel::MimeType.for Pathname.new(path)
)
If we can't find any actual desired originals that fail content type detection, maybe we don't need to worry about it for now.
We could also look at updating our local mimemagic list of files (https://github.com/minad/mimemagic#extra-magic-overlay) as well as calling out to file if it comes up as an issue.
It is not obvious to me how to figure out the right magic byte overlay to correctly identify these odd ADTS mp3s, but I guess we could figure it out somehow!
For the Herman Mark oral history original mp3s, they were correctly described as audio in all cases by New Thing.
I think maybe we're no longer concerned about this issue, as the "certain" MP3s were mistakes, and we may not run into any other such certain MP3s in real production.
It turned out not to be completely trivial to fix this.
The fix would probably be falling back to shell out to unix file
command in cases where the ruby content-type detector can't come up with a type. As file
managed to for these files. One downside of that is different versions of file
on different OSs/versions may not come up with the same answers, making reproducing problems somewhat more confusing.
Won't work on for now, since we're not sure we'll encounter it with files we mean to ingest.
Take a look at the two following assets:
https://kithe.sciencehistory.org/admin/asset_files/rujbxiy https://kithe.sciencehistory.org/admin/asset_files/c676fil
Downloading them, then running the file command on them yields:
fenselau_c_0710_1-2.mp3: MPEG ADTS, layer III, v2, 128 kbps, 22.05 kHz, Monaural
fenselau_c_0710_1-3.mp3: MPEG ADTS, layer III, v2, 128 kbps, 22.05 kHz, Monaural
The resulting content type is
application/octet-stream
and audio derivatives are never generated.Compare e.g. https://kithe.sciencehistory.org/admin/asset_files/x633f206c
mark_h_0030_1-1.mp3: Audio file with ID3 version 2.4.0, contains:MPEG ADTS, layer III, v1, 64 kbps, 48 kHz, Monaural
which is correctly characterized as
audio/mpeg
.