Closed jbielick closed 2 years ago
Isn't Marcel using the MIME type definitions of Tika?
Tika and file would recognize the correct MIME type:
$ echo '<!doctype html><html><body><svg></svg></body></html>' > test
$ wget https://dlcdn.apache.org/tika/1.28.1/tika-app-1.28.1.jar
$ java -jar tika-app-1.28.1.jar --detect test
...
text/html
$ echo '<!doctype html><html><body><svg></svg></body></html>' > test
$ file --mime-type test
test: text/html
$ file --version
file-5.38
magic file from /etc/magic:/usr/share/misc/magic
You could add the definitions for text/html before the definitions of image/svg+xml:
$ bundle exec rails c
> Marcel::MimeType.extend(
'text/html',
extensions: ['html', 'htm'],
magic: [
[0..64, "<!DOCTYPE HTML"],
[0..64, "<!DOCTYPE html"],
[0..64, "<!doctype HTML"],
[0..64, "<!doctype html"],
[0..64, "<HEAD"],
[0..64, "<head"],
[0..64, "<TITLE"],
[0..64, "<title"],
[0..64, "<HTML"],
[0, "<BODY"],
[0, "<body"],
[0, "<DIV"],
[0, "<div"],
[0, "<TITLE"],
[0, "<title"],
[0, "<h1"],
[0, "<H1"],
[0..128, "<html"]
]
)
> Marcel::MimeType.extend(
'text/html',
extensions: ['html', 'htm'],
magic: [
[128..8192, "<html"]
]
)
> Marcel::MimeType.for(StringIO.new('<!doctype html><html><body><svg></svg></body></html>'))
=> "text/html"
> Marcel::TYPES.find_all { |magic| magic[0] == 'text/html' }
=> [["text/html", [["html", "htm"], [], "HyperText Markup Language"]]]
> Marcel::MAGIC.each_index.find_all { |i| Marcel::MAGIC[i][0] == 'text/html' }
=> [361, 381]
> pp Marcel::MAGIC.find_all { |magic| magic[0] == 'text/html' }
[["text/html",
[[0..64, "<!DOCTYPE HTML"],
[0..64, "<!DOCTYPE html"],
[0..64, "<!doctype HTML"],
[0..64, "<!doctype html"],
[0..64, "<HEAD"],
[0..64, "<head"],
[0..64, "<TITLE"],
[0..64, "<title"],
[0..64, "<HTML"],
[0, "<BODY"],
[0, "<body"],
[0, "<DIV"],
[0, "<div"],
[0, "<TITLE"],
[0, "<title"],
[0, "<h1"],
[0, "<H1"],
[0..128, "<html"]]],
["text/html", [[128..8192, "<html"]]]]
> Marcel::TYPES.find_all { |magic| magic[0] == 'image/svg+xml' }
=> [["image/svg+xml", [["svg", "svgz"], ["application/xml"], "Scalable Vector Graphics"]]]
> Marcel::MAGIC.each_index.find_all { |i| Marcel::MAGIC[i][0] == 'image/svg+xml' }
=> [22]
> pp Marcel::MAGIC.find_all { |magic| magic[0] == 'image/svg+xml' }
[["image/svg+xml", [[0..4096, "<svg"]]]]
=> [["image/svg+xml", [[0..4096, "<svg"]]]]
Here is a workaround where I directly change the position of the SVG, HTML and XML magic configurations.
# Delete the 3 MIME types from Marvel remembering their indices
svg_index = Marcel::MAGIC.find_index { |mime, _| mime == 'image/svg+xml' }
svg = Marcel::MAGIC.delete_at(svg_index)
xml_index = Marcel::MAGIC.find_index { |mime, _| mime == 'application/xml' }
xml = Marcel::MAGIC.delete_at(xml_index)
html_index = Marcel::MAGIC.find_index { |mime, _| mime == 'text/html' }
html = Marcel::MAGIC.delete_at(html_index)
# Insert them in the priority we want
new_index = [svg_index, xml_index, html_index].max
Marcel::MAGIC.insert(new_index, html, svg, xml)
Summary
A properly declared HTML document that contains some svg is being detected as
image/svg+xml
. No hints given to Marcel seem to affect this result.Reproduction:
What I expect to happen:
Returns
text/html
What actually happens:
Returns
image/svg+xml
.From what I could gather, the
image/svg+xml
matcher is position 22 in the Magic types andtext/html
is position 361. SVG's magic range is ... large, so it pretty much always wins and believes the magic bytes to prove the type.What's the ideal solution here?