rails / marcel

Find the mime type of files, examining file, filename and declared type
Apache License 2.0
386 stars 67 forks source link

Summary of a number of differences in mime type reporting before and after Tika #48

Open malclocke opened 3 years ago

malclocke commented 3 years ago

Hello :wave:

In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.

I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.

I then ran the following test script against this set of files:

require "marcel"

ARGV.each do |filename|
  basename = File.basename(filename)

  File.open(filename) do |file|
    puts "%s %s" % [basename, Marcel::MimeType.for(file, name: basename)]
  end
end

I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b3

The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.

Anyway, I figured this list may be useful. Feel free to close this issue if it's not. :smiling_face_with_three_hearts:

mimetype_for_diff-v0.3.3-a525d5b3.csv

gmcgibbon commented 3 years ago

I don't think we can legally use those files as our fixtures, but this is good to know nonetheless. Thanks for the info!

Here's a table version of the CSV detailing all the affected types (minus types I've already fixed in PRs), we can track fixes for these in this issue:

PR open? file v0.3.3 type a525d5b3 type
[ ] 32x-rom.32x application/x-genesis-32x-rom application/octet-stream
[ ] 3ds-tloz-mm.3ds image/x-3ds application/octet-stream
[ ] 4jsno.669 audio/x-mod application/octet-stream
[ ] adf-test.adf application/x-amiga-disk-format application/octet-stream
[ ] aero_alt.cur image/x-win-bitmap application/octet-stream
[ ] all_w.m3u8 audio/x-mpegurl application/vnd.apple.mpegurl
[ ] Anaphraseus-1.21-beta.oxt application/vnd.openofficeorg.extension application/zip
[ ] androide.k7 application/x-thomson-cassette audio/mpeg
[ ] aportis.pdb application/x-aportisdoc chemical/x-pdb
[ ] archive.lrz application/x-lrzip application/octet-stream
[ ] ascii.stl model/stl application/vnd.ms-pki.stl
[ ] atari-2600-test.A26 application/x-atari-2600-rom application/octet-stream
[ ] atari-7800-test.A78 application/x-atari-7800-rom application/octet-stream
[ ] atari-lynx-chips-challenge.lnx application/x-atari-lynx-rom application/octet-stream
[ ] bathead.sk image/x-skencil application/octet-stream
[ ] bibtex.bib text/x-bibtex text/x-matlab
[ ] binary.stl model/stl application/vnd.ms-pki.stl
[ ] blitz.m7 application/x-thomson-cartridge-memo7 application/octet-stream
[ ] break.mtm audio/x-mod application/octet-stream
[ ] bug106330.iso application/x-cd-image application/x-iso9660-image
[ ] bug-30656-xchat.conf application/octet-stream text/x-config
[ ] ccfilm.axv video/ogg application/ogg
[ ] classiq1.hfe application/x-hfe-floppy-image application/octet-stream
[ ] combined.karbon application/x-karbon application/zip
[ ] comics.cb7 application/x-cb7 application/x-7z-compressed
[ ] comics.cbt application/x-cbt application/x-gtar
[x] ct_faac-adts.aac audio/aac audio/x-aac
[ ] cyborg.med audio/x-mod application/octet-stream
[ ] dbus-comment.service text/x-dbus-service application/octet-stream
[ ] dbus.service text/x-dbus-service application/octet-stream
[ ] debian-goodies_0.63_all.deb application/vnd.debian.binary-package application/x-debian-package
[ ] dia.shape application/x-dia-shape image/svg+xml
[ ] disk.img application/x-raw-disk-image application/octet-stream
[ ] disk.raw-disk-image application/x-raw-disk-image application/octet-stream
[ ] disk.vhd text/x-vhdl application/x-vhd
[ ] dreamcast-us-samba-de-amigo-track-1.bin application/x-sega-cd-rom application/octet-stream
[ ] Empty.chrt application/x-kchart application/zip
[ ] en_US.zip.meta4 application/metalink4+xml application/xml
[ ] esm.mjs application/javascript application/octet-stream
[ ] eu_en_Sword_of_Vermilion.bin application/x-sega-cd-rom audio/mpeg
[ ] example_42_all.snap application/vnd.snap application/octet-stream
[ ] feed2 application/rss+xml application/xml
[ ] feed.atom application/atom+xml application/xml
[ ] feed.rss application/rss+xml application/xml
[ ] feeds.opml text/x-opml+xml application/xml
[ ] fuji.themepack application/x-windows-themepack application/vnd.ms-cab-compressed
[ ] game-boy-color-test.gbc application/x-gameboy-color-rom application/octet-stream
[ ] game-boy-test.gb application/x-gameboy-color-rom application/octet-stream
[ ] game-gear-test.gg application/x-gamegear-rom application/octet-stream
[ ] GammaChart.exr image/x-exr image/aces
[ ] gedit.flatpakref application/vnd.flatpak.ref application/octet-stream
[ ] genesis1.bin application/x-genesis-rom application/octet-stream
[ ] genesis2.bin application/x-genesis-rom application/octet-stream
[ ] gnome.flatpakrepo application/vnd.flatpak.repo application/octet-stream
[ ] gtk-builder.ui application/x-designer application/xml
[ ] hbo-playlist.qtl application/x-quicktime-media-link application/octet-stream
[ ] hello.flatpak application/vnd.flatpak application/octet-stream
[ ] helloworld.groovy text/x-modelica text/x-groovy
[ ] helloworld.java text/x-java text/x-java-source
[ ] helloworld.xpi application/x-xpinstall application/zip
[ ] hello.xdgapp application/vnd.flatpak application/octet-stream
[ ] hereyes_remake.mo3 audio/x-mo3 application/octet-stream
[ ] image.sqsh application/vnd.squashfs application/octet-stream
[ ] ISOcyr1.ent application/xml-external-parsed-entity text/plain
[ ] iso-file.iso application/x-cd-image application/x-iso9660-image
[ ] IWAD.WAD application/x-doom-wad application/x-doom
[ ] javascript-without-extension application/javascript application/x-sh
[ ] jc-win.ani application/x-navi-animation application/octet-stream
[ ] json-ld-full-iri.jsonld application/ld+json application/octet-stream
[ ] layersupdatesignals.flw application/x-kivio application/zip
[ ] Leafpad-0.8.17-x86_64.AppImage application/x-iso9660-appimage application/x-elf
[ ] Leafpad-0.8.18.1.glibc2.4-x86_64.AppImage application/x-iso9660-appimage application/x-elf
[ ] linguist.ts text/vnd.qt.linguist application/xml
[ ] live-streaming.m3u audio/x-mpegurl application/vnd.apple.mpegurl
[ ] ls application/x-sharedlib application/x-elf
[ ] m64p_test_rom.n64 application/x-n64-rom application/octet-stream
[ ] m64p_test_rom.v64 application/x-n64-rom application/octet-stream
[ ] m64p_test_rom.z64 application/x-n64-rom application/octet-stream
[ ] markdown.md text/markdown text/x-web-markdown
[ ] mega-drive-rom.gen application/x-genesis-rom application/octet-stream
[ ] Metroid_japan.fds application/x-fds-disk application/octet-stream
[ ] msg0001.gsm audio/x-gsm application/octet-stream
[ ] msx2-metal-gear.msx application/x-msx-rom application/octet-stream
[ ] msx-penguin-adventure.msx application/x-msx-rom application/octet-stream
[ ] my-data.json-patch application/json-patch+json application/octet-stream
[ ] mypaint.ora image/openraster application/zip
[ ] neo-geo-pocket-color-test.ngc application/x-neo-geo-pocket-color-rom application/octet-stream
[ ] neo-geo-pocket-test.ngp application/x-neo-geo-pocket-rom application/octet-stream
[ ] nrl.trig application/trig application/octet-stream
[ ] ooo.stw application/vnd.sun.xml.writer.template application/vnd.sun.xml.writer
[ ] ooo-test.fodg application/vnd.oasis.opendocument.graphics-flat-xml application/xml
[ ] ooo-test.fodp application/vnd.oasis.opendocument.presentation-flat-xml application/xml
[ ] ooo-test.fods application/vnd.oasis.opendocument.spreadsheet-flat-xml application/xml
[ ] ooo-test.fodt application/vnd.oasis.opendocument.text-flat-xml application/xml
[ ] ooo.vor application/vnd.stardivision.writer application/x-staroffice-template
[ ] Oriental_tattoo_by_daftpunk22.eps image/x-eps application/postscript
[ ] panasonic_lumix_dmc_fz38_05.rw2 image/x-panasonic-rw2 image/x-raw-panasonic
[ ] petite-ouverture-a-danser.ly text/x-lilypond application/octet-stream
[ ] pico-rom.bin application/x-sega-pico-rom application/octet-stream
[ ] playlist.asx audio/x-ms-asx application/x-ms-asx
[ ] playlist.mrl text/x-mrml application/octet-stream
[ ] playlist.wpl application/vnd.ms-wpl text/html
[ ] plugins.qmltypes text/x-qml application/octet-stream
[ ] pocket-word.psw application/x-pocket-word application/octet-stream
[ ] Presentation.kpt application/x-kpresenter application/gzip
[ ] project.glade application/x-glade application/xml
[ ] PWAD.WAD application/x-doom-wad application/x-doom
[ ] pyside.py text/x-python3 text/x-python
[ ] raw-mjpeg.mjpeg video/x-mjpeg image/jpeg
[ ] README-pandoc-flavored-markdown.md text/markdown text/x-matlab
[ ] rectangle.qml text/x-qml application/octet-stream
[ ] registry-nt.reg text/x-ms-regedit application/octet-stream
[ ] registry.reg text/x-ms-regedit text/plain
[ ] reStructuredText.rst application/octet-stream text/x-rst
[ ] rgb-reference.ktx image/ktx application/octet-stream
[ ] ringtone.ime text/x-imelody application/octet-stream
[ ] ringtone.mmf application/x-smaf application/vnd.smaf
[ ] ripoux.sap application/x-thomson-sap-image application/octet-stream
[ ] sample1.nzb application/x-nzb application/xml
[ ] sample.vsdx application/vnd.ms-visio.drawing.main+xml application/x-tika-ooxml
[ ] saturn-test.bin application/x-saturn-rom application/octet-stream
[ ] sega-cd-test.iso application/x-sega-cd-rom application/x-iso9660-image
[ ] serafettin.rar application/pdf application/x-rar-compressed;version=4
[ ] settopbox.ts video/mp2t application/octet-stream
[ ] sg1000-test.sg application/x-sg1000-rom application/octet-stream
[ ] shebang.qml text/x-qml application/x-sh
[ ] shell-calls-awk application/x-perl application/x-sh
[ ] simon.669 audio/x-mod application/octet-stream
[ ] sms-test.sms application/x-sms-rom application/octet-stream
[ ] sqlite2.kexi application/x-kexiproject-sqlite2 application/octet-stream
[ ] sqlite3.kexi application/vnd.sqlite3 application/x-sqlite3
[ ] stream.nsc application/x-netshow-channel application/octet-stream
[ ] subtitle-microdvd.sub text/x-microdvd application/octet-stream
[ ] subtitle-mpsub.sub text/x-microdvd application/octet-stream
[ ] subtitle.srt application/x-subrip application/octet-stream
[ ] subtitle.ssa text/x-ssa application/octet-stream
[ ] subtitle-subviewer.sub text/x-microdvd application/octet-stream
[ ] systemd.automount text/x-systemd-unit application/octet-stream
[ ] systemd.device text/x-systemd-unit application/octet-stream
[ ] systemd.mount text/x-systemd-unit application/octet-stream
[ ] systemd.path text/x-systemd-unit application/octet-stream
[ ] systemd.scope text/x-systemd-unit application/octet-stream
[ ] systemd.service text/x-dbus-service application/octet-stream
[ ] systemd.slice text/x-systemd-unit application/octet-stream
[ ] systemd.socket text/x-systemd-unit application/octet-stream
[ ] systemd.swap text/x-systemd-unit application/octet-stream
[ ] systemd.target text/x-systemd-unit application/octet-stream
[ ] systemd.timer text/x-systemd-unit application/octet-stream
[ ] test10.gpx application/gpx+xml application/xml
[ ] test3.py text/x-python3 text/x-python
[ ] test.aa audio/x-pn-audibleaudio application/octet-stream
[ ] test.aax audio/x-pn-audibleaudio video/quicktime
[ ] test.alz application/x-alz application/octet-stream
[ ] test.bflng text/html application/xml
[ ] test.bsdiff application/x-bsdiff application/octet-stream
[ ] testcases.ksp application/x-kspread application/gzip
[ ] test.ccmx application/x-ccmx application/octet-stream
[ ] test-cdda.toc application/x-cdrdao-toc application/octet-stream
[ ] test-cdrom.toc application/x-cdrdao-toc application/octet-stream
[ ] test.class application/x-java application/java-vm
[ ] test.cl text/x-opencl-src text/x-common-lisp
[ ] test.cmake text/x-cmake application/octet-stream
[ ] test.coffee application/vnd.coffeescript text/x-coffeescript
[ ] test.csvs text/csv-schema application/octet-stream
[ ] test.dot text/vnd.graphviz application/msword
[ ] test.d text/x-dsrc text/x-d
[ ] test-en.mo application/x-gettext-translation application/octet-stream
[ ] test-en.po text/x-gettext-translation application/octet-stream
[ ] test.eps image/x-eps application/postscript
[ ] test.feature text/x-gherkin application/octet-stream
[ ] test.fit image/fits application/fits
[ ] test.fl application/x-fluid application/octet-stream
[ ] test.fli video/x-flic video/x-fli
[ ] test.g3 image/fax-g3 image/g3fax
[ ] test.gbr image/x-gimp-gbr application/octet-stream
[ ] test.gcode text/x.gcode application/octet-stream
[ ] test.geojson application/geo+json application/json
[ ] test.geo.json application/json application/octet-stream
[ ] test.gih image/x-gimp-gih application/octet-stream
[ ] test.gnd application/gnunet-directory application/octet-stream
[ ] test.gpx application/gpx+xml application/xml
[ ] test.gs text/x-genie application/octet-stream
[ ] test.html text/html application/xml
[ ] test-html-with-svg.html text/html image/svg+xml
[ ] test.ilbm image/x-ilbm audio/x-aiff
[ ] test.im1 image/x-sun-raster application/octet-stream
[ ] test.iptables text/x-iptables application/octet-stream
[ ] test.ipynb application/x-ipynb+json application/octet-stream
[ ] test_issue127.py text/x-python3 text/x-python
[ ] test.it87 application/x-it87 application/octet-stream
[ ] test.jar application/x-java-archive application/java-archive
[ ] test.jceks application/x-java-jce-keystore application/octet-stream
[ ] test.jks application/x-java-keystore application/octet-stream
[ ] test.jnlp application/x-java-jnlp-file application/xml
[ ] test.kdc image/x-kodak-kdc image/tiff
[ ] test-kounavail2.kwd application/x-kword application/gzip
[ ] test.lzo application/x-lzop application/octet-stream
[ ] test.manifest text/cache-manifest text/plain
[ ] test.metalink application/metalink+xml application/xml
[ ] test.mml application/mathml+xml application/octet-stream
[ ] test.mobi application/x-mobipocket-ebook application/octet-stream
[ ] test.mof text/x-mof application/x-mobipocket-ebook
[ ] test.mo text/x-modelica text/plain
[ ] test.mpc audio/x-musepack application/vnd.mophun.certificate
[ ] test.msi application/x-msi application/x-ms-installer
[ ] test.ogg audio/ogg audio/vorbis
[ ] test.ooc text/x-ooc application/octet-stream
[ ] test.opus audio/ogg audio/opus
[ ] test.owx application/owl+xml application/xml
[ ] test.oxps application/oxps application/zip
[ ] test.p12 application/pkcs12 application/x-pkcs12
[ ] test.pat image/x-gimp-pat application/octet-stream
[ ] test.pgn application/vnd.chess-pgn application/x-chess-pgn
[ ] test.php application/x-php text/html
[ ] test.pl application/x-perl text/x-perl
[ ] test.pm application/x-perl application/x-tika-msoffice
[ ] test.pmd application/x-pagemaker text/x-perl
[ ] test.por application/x-spss-por application/octet-stream
[ ] test.pot text/x-gettext-translation-template application/vnd.ms-powerpoint
[ ] test.py3 text/x-python3 application/x-sh
[ ] test.py text/x-python3 text/x-python
[ ] test.pyx text/x-python application/octet-stream
[ ] test.qp application/x-qpress application/octet-stream
[ ] test.qti application/x-qtiplot application/octet-stream
[ ] test.raml application/raml+yaml application/octet-stream
[ ] test-reordered.ipynb application/x-ipynb+json application/octet-stream
[ ] test.rs text/rust application/rls-services+xml
[x] test.sass text/x-sass application/octet-stream
[ ] test.sav application/x-spss-sav application/octet-stream
[x] test.scss text/x-scss application/octet-stream
[ ] test-secret.key application/pgp-keys application/vnd.apple.keynote
[ ] test-secret-key.skr application/pgp-keys application/octet-stream
[ ] test.sgi image/x-sgi image/x-rgb
[ ] test.sqlite2 application/x-sqlite2 application/octet-stream
[ ] test.sqlite3 application/vnd.sqlite3 application/x-sqlite3
[ ] test.ss text/x-scheme text/plain
[ ] test.svh text/x-svhdr application/octet-stream
[ ] test.sv text/x-svsrc application/octet-stream
[ ] test.t application/x-perl application/x-lz4
[ ] test.tar.lz4 application/x-lz4 application/x-lzip
[ ] test.tar.lz application/x-lzip application/zstd
[ ] test.tar.zst application/octet-stream application/msword
[ ] test-template.dot application/msword-template application/x-tex
[ ] test.tex text/x-tex image/x-tga
[ ] test.tga image/x-tga image/tiff
[ ] test.tif image/tiff application/octet-stream
[ ] test.ts video/mp2t text/troff
[ ] test.ttl text/turtle application/octet-stream
[ ] test.ttx application/x-font-ttx application/xml
[ ] test.twig text/x-twig application/octet-stream
[ ] test.url application/x-mswinurl application/octet-stream
[ ] test.uue text/x-uuencode application/octet-stream
[ ] test.vala text/x-vala application/octet-stream
[ ] test-vpn.pcf application/x-cisco-vpn-settings application/x-font-pcf
[ ] test.wim application/x-ms-wim application/octet-stream
[ ] test.xar application/x-xar application/vnd.xara
[ ] test.xht application/xhtml+xml application/xml
[ ] test.xhtml application/xhtml+xml application/xml
[ ] test.xlr application/vnd.ms-works application/x-tika-msworks-spreadsheet
[ ] test.xml.in application/xml text/plain
[ ] test.xpm image/x-xpixmap image/x-xbitmap
[ ] test.xps application/oxps application/zip
[ ] test.xsl application/xslt+xml application/xml
[ ] test.yaml application/x-yaml text/x-yaml
[ ] test.zst application/octet-stream application/zstd
[ ] text.qmlproject text/x-qml application/octet-stream
[ ] text.wwf application/x-wwf application/pdf
[ ] TS010082249.pub application/vnd.ms-publisher application/x-mspublisher
[ ] Utils.jsm application/javascript text/plain
[ ] virtual-boy-wario-land.vb application/x-virtual-boy-rom text/x-vbdotnet
[ ] webfinger.jrd application/jrd+json application/octet-stream
[ ] white_640x480.kra application/x-krita application/zip
[ ] wii.wad application/x-wii-wad application/x-doom
[ ] wonderswan-color-chocobo.wsc application/x-wonderswan-color-rom application/octet-stream
[ ] wonderswan-rockman-forte.ws application/x-wonderswan-rom application/octet-stream
[ ] x_speex_ogg.spx video/ogg audio/speex
[ ] zeb.3ds image/x-3ds application/octet-stream
pixeltrix commented 3 years ago

I can see that getting legitimate test files for some of these could be fraught with issues - Genesis ROMs for example 😬

pixeltrix commented 3 years ago

Also not all of the differences are bugs - for example serafettin.rar returns application/pdf in 0.3.3 which is obviously wrong so we shouldn't change that back (though there's still a question whether the new value is correct).

malclocke commented 3 years ago

@pixeltrix I agree that these are not all bugs. It might be nice to ask for some kind of reference (e.g. an RFC or similar) when a PR is submitted to fix a 'regression' as to why the old value is more correct than the new one.