rails / marcel

Find the mime type of files, examining file, filename and declared type
Apache License 2.0
386 stars 67 forks source link

ZIP archive misidentified as video/x-ms-wmv #77

Open mdavidn opened 2 years ago

mdavidn commented 2 years ago

I have a valid ZIP archive that happens to include the bytes wmv2 in the first four kilobytes. Active Storage misidentifies the file as Windows Media Video. When scanning over such a broad range of bytes, WMV magic needs a lower priority than other matches.

Marcel::MimeType.for Pathname.new('A-453.zip'), name: 'A-453.zip', declared_type: 'application/zip'
# => "video/x-ms-wmv"

File.read('A-453.zip')[0...4]
# => "PK\u0003\u0004"

File.read('A-453.zip').index('wmv2')
# => 585

`unzip -t A-453.zip`.chomp.split("\n").last
# => "No errors detected in compressed data of A-453.zip."
mdavidn commented 2 years ago

Here's my workaround for now, added to an initializer.

if Marcel::MimeType.for("PK\03\04wmv2") == 'video/x-ms-wmv'
  Marcel::Magic.remove('video/x-ms-wmv')
end
pixeltrix commented 2 years ago

Just been bitten by this for a PDF as well - looking at the definition here it seems like that any instance of the string wmv2 in the first 8KB will trigger this match:

https://github.com/rails/marcel/blob/8e285636063d3617df6f73bc73de6574d83a53d5/data/tika.xml#L7701-L7715

Seems wildly broad as a magic string but I think the issue is the Tika rule is designed to match a codec type so would only apply in the context of a file ending in .wmv whereas Marcel is applying it as a general magic string. There could be other examples of mismatches like this in the Tika source file 😬