rails / marcel

Find the mime type of files, examining file, filename and declared type
Apache License 2.0
386 stars 67 forks source link

Various office files wrong mimetype #44

Closed cars10 closed 3 years ago

cars10 commented 3 years ago

According to this the mimetype of .oft (outlook email emplate) should be application/vnd.ms-outlook, but marcel currently reports application/x-tika-msoffice.

Some other issues include:

All of these files worked on marcel 0.3.3

gmcgibbon commented 3 years ago

oft and accsdb I can't get to work by name with 0.3.3, maybe they only work by content matching. I'll work on patching the others into the current types db in the meantime. If you could edit this test case to reproduce the type matching you describe in 0.3.3, that would be helpful. Thanks!

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  # gem "marcel", github: "rails/marcel", branch: "main"
  gem "marcel", "0.3.3"
  gem "minitest"
end

require "minitest"
require "minitest/autorun"
require "marcel"

class BugTest < Minitest::Test
  { oft: "application/octet-stream", accesdb: "application/octet-stream", mdb: "application/vnd.ms-access", mht: "application/x-mimearchive" }.each do |ext, type|
    define_method("test_#{ext}") do
      assert_equal type, Marcel::MimeType.for(name: "file.#{ext}")
    end
  end
end
cars10 commented 3 years ago

office_files.zip Hi, some errors on my side:

The following code works with the attached files:

# frozen_string_literal: true

require 'bundler/inline'

gemfile(true) do
  source 'https://rubygems.org'

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  # gem "marcel", github: "rails/marcel", branch: "main"
  gem 'marcel', '0.3.3'
  gem 'minitest'
end

require 'minitest'
require 'minitest/autorun'
require 'marcel'

class BugTest < Minitest::Test
  {
    oft: 'application/x-ole-storage',
    accdb: 'application/octet-stream',
    mdb: 'application/vnd.ms-access',
    mde: 'application/vnd.ms-access',
    mht: 'message/rfc822'
  }.each do |ext, type|
    define_method("test_#{ext}") do
      assert_equal type, Marcel::MimeType.for(Pathname.new("files/#{ext}.#{ext}"))
    end
  end
end

office_files.zip

gmcgibbon commented 3 years ago

It looks like the TTF identification bug is being caused by a manual mime extension: https://github.com/rails/marcel/blob/a525d5b38f287ca0511c8eb26e657a1d46686e5f/lib/marcel/mime_type/definitions.rb#L40

We either need to define magic for accdb/mdb files or adjust the magic matcher for TTFs. Preferably, both access DB types should be able to identify as application/vnd.ms-access.

gmcgibbon commented 3 years ago

I can't find a good magic matcher for oft files. Microsoft's specs for file naming only mention Word, PowerPoint, and Excel.

If it isn't feasible to match to a more specific type for outlook files, we may want to look at falling back to application/x-ole-storage in cases where we would return the new tika type.