rails / marcel

Find the mime type of files, examining file, filename and declared type
Apache License 2.0
386 stars 67 forks source link

image/svg+xml returned for an html document with svg in it #67

Closed jbielick closed 2 years ago

jbielick commented 2 years ago

Summary

A properly declared HTML document that contains some svg is being detected as image/svg+xml. No hints given to Marcel seem to affect this result.

Reproduction:

Marcel::MimeType.for(
  StringIO.new('<!doctype html><html><body><svg></svg></body></html>'), 
  name: 'index.html', 
  declared_type: 'text/html',
)
=> 'image/svg+xml

What I expect to happen:

Returns text/html

What actually happens:

Returns image/svg+xml.

From what I could gather, the image/svg+xml matcher is position 22 in the Magic types and text/html is position 361. SVG's magic range is ... large, so it pretty much always wins and believes the magic bytes to prove the type.

What's the ideal solution here?

vakuum commented 2 years ago

Isn't Marcel using the MIME type definitions of Tika?

Tika and file would recognize the correct MIME type:

Tika

$ echo '<!doctype html><html><body><svg></svg></body></html>' > test

$ wget https://dlcdn.apache.org/tika/1.28.1/tika-app-1.28.1.jar

$ java -jar tika-app-1.28.1.jar --detect test 
...
text/html

file

$ echo '<!doctype html><html><body><svg></svg></body></html>' > test

$ file --mime-type test
test: text/html

$ file --version
file-5.38
magic file from /etc/magic:/usr/share/misc/magic
vakuum commented 2 years ago

Workaround?

You could add the definitions for text/html before the definitions of image/svg+xml:

$ bundle exec rails c

> Marcel::MimeType.extend(
  'text/html',
  extensions: ['html', 'htm'],
  magic: [
    [0..64, "<!DOCTYPE HTML"],
    [0..64, "<!DOCTYPE html"],
    [0..64, "<!doctype HTML"],
    [0..64, "<!doctype html"],
    [0..64, "<HEAD"],
    [0..64, "<head"],
    [0..64, "<TITLE"],
    [0..64, "<title"],
    [0..64, "<HTML"],
    [0, "<BODY"],
    [0, "<body"],
    [0, "<DIV"],
    [0, "<div"],
    [0, "<TITLE"],
    [0, "<title"],
    [0, "<h1"],
    [0, "<H1"],
    [0..128, "<html"]
  ]
)

> Marcel::MimeType.extend(
  'text/html',
  extensions: ['html', 'htm'],
  magic: [
    [128..8192, "<html"]
  ]
)

> Marcel::MimeType.for(StringIO.new('<!doctype html><html><body><svg></svg></body></html>'))
 => "text/html"

Type definitions for text/html

> Marcel::TYPES.find_all { |magic| magic[0] == 'text/html' }
 => [["text/html", [["html", "htm"], [], "HyperText Markup Language"]]] 

> Marcel::MAGIC.each_index.find_all { |i| Marcel::MAGIC[i][0] == 'text/html' }
 => [361, 381]

> pp Marcel::MAGIC.find_all { |magic| magic[0] == 'text/html' }
[["text/html",
  [[0..64, "<!DOCTYPE HTML"],
   [0..64, "<!DOCTYPE html"],
   [0..64, "<!doctype HTML"],
   [0..64, "<!doctype html"],
   [0..64, "<HEAD"],
   [0..64, "<head"],
   [0..64, "<TITLE"],
   [0..64, "<title"],
   [0..64, "<HTML"],
   [0, "<BODY"],
   [0, "<body"],
   [0, "<DIV"],
   [0, "<div"],
   [0, "<TITLE"],
   [0, "<title"],
   [0, "<h1"],
   [0, "<H1"],
   [0..128, "<html"]]],
 ["text/html", [[128..8192, "<html"]]]]

Type definitions for image/svg+xml

> Marcel::TYPES.find_all { |magic| magic[0] == 'image/svg+xml' }
 => [["image/svg+xml", [["svg", "svgz"], ["application/xml"], "Scalable Vector Graphics"]]] 

> Marcel::MAGIC.each_index.find_all { |i| Marcel::MAGIC[i][0] == 'image/svg+xml' }
 => [22] 

> pp Marcel::MAGIC.find_all { |magic| magic[0] == 'image/svg+xml' }
[["image/svg+xml", [[0..4096, "<svg"]]]]
 => [["image/svg+xml", [[0..4096, "<svg"]]]] 
genezys commented 2 years ago

Here is a workaround where I directly change the position of the SVG, HTML and XML magic configurations.

# Delete the 3 MIME types from Marvel remembering their indices
svg_index = Marcel::MAGIC.find_index { |mime, _| mime == 'image/svg+xml' }
svg = Marcel::MAGIC.delete_at(svg_index)
xml_index = Marcel::MAGIC.find_index { |mime, _| mime == 'application/xml' }
xml = Marcel::MAGIC.delete_at(xml_index)
html_index = Marcel::MAGIC.find_index { |mime, _| mime == 'text/html' }
html = Marcel::MAGIC.delete_at(html_index)

# Insert them in the priority we want
new_index = [svg_index, xml_index, html_index].max
Marcel::MAGIC.insert(new_index, html, svg, xml)