Upload: assign mediatype based on content sniffing - Githubissues

orbeon / orbeon-forms

Orbeon Forms is an open source web forms solution. It includes an XForms engine, the Form Builder web-based form editor, and the Form Runner runtime.

http://www.orbeon.com/

GNU Lesser General Public License v2.1

518 stars 220 forks source link

Upload: assign mediatype based on content sniffing #1838

Open ebruchez opened 10 years ago

ebruchez commented 10 years ago

it's never reliable to depend on the client
see also this article

+1 from customer

avernet commented 10 years ago

For sniffing the media type on the server, Apache Tika seems to be a good option.

+1 from customer

ebruchez commented 10 years ago

Would be a really useful one but as of 2014-07, both customers don't have this high on their priority list.

ebruchez commented 4 years ago

+1 from customer

ebruchez commented 4 years ago

+1 from customer

ebruchez commented 4 years ago

One issue with Apache Tika is that it will pull in many dependencies. See the pom.xml.

ebruchez commented 4 years ago

The main worry with adding many dependencies is security vulnerabilities. So we should attempt to control which exact formats we support and only take in a few external dependencies. For example, we probably don't need to detect Ogg Vorbis formats out of the box. But we need:

PDF
common image/video/audio formats
Excel/Word (maybe we can just extract the relevant code from POI)
plain text formats

ebruchez commented 4 years ago

A quick list of dependencies this would add:

org.gagravarr.vorbis-java-tika
org.tallison.jmatio
org.apache.james.apache-mime4j-core
org.apache.james.apache-mime4j-dom
org.apache.commons.commons-compress
org.tukaani.xz
com.epam.parso
org.brotli.dec
com.github.luben.zstd-jni
commons-codec.commons-codec
org.apache.pdfbox.pdfbox
org.apache.pdfbox.pdfbox-tools
org.apache.pdfbox.preflight
org.apache.pdfbox.jempbox
org.apache.poi.poi
org.apache.poi.poi-scratchpad
org.apache.poi.poi-ooxml
com.healthmarketscience.jackcess.jackcess
com.healthmarketscience.jackcess.jackcess-encrypt
org.ow2.asm.asm
com.googlecode.mp4parser.isoparser
de.l3s.boilerpipe.boilerpipe
com.rometools.rome
org.gagravarr.vorbis-java-core
com.googlecode.juniversalchardet.juniversalchardet
org.codelibs.jhighlight
com.pff.java-libpst
com.github.junrar.junrar
org.apache.cxf.cxf-rt-rs-client
org.apache.commons.commons-exec
org.xerial.sqlite-jdbc
org.apache.opennlp.opennlp-tools
commons-io.commons-io
com.googlecode.json-simple.json-simple
com.github.openjson.openjson
com.google.code.gson.gson
org.slf4j.jul-to-slf4j
org.slf4j.jcl-over-slf4j
edu.ucar.netcdf4
org.jdom.jdom2
com.google.guava.guava
edu.ucar.grib
com.beust.jcommander
net.java.dev.jna.jna
org.jsoup.jsoup
com.google.protobuf.protobuf-java
edu.ucar.cdm
org.quartz-scheduler.quartz
com.mchange.c3p0
edu.ucar.httpservices
org.apache.commons.commons-csv
org.apache.sis.core.sis-utility
org.apache.sis.storage.sis-netcdf
org.apache.sis.core.sis-metadata
org.opengis.geoapi
edu.usc.ir.sentiment-analysis-parser
org.apache.ctakes.ctakes-core
org.apache.uima.uimafit-core
org.apache.uima.uimaj-core
org.apache.pdfbox.jbig2-imageio
com.github.jai-imageio.jai-imageio-jpeg2000

We already depend on:

org.bouncycastle.bcmail-jdk15on
org.bouncycastle.bcprov-jdk15on
org.ccil.cowan.tagsoup.tagsoup
org.tallison.metadata-extractor
org.slf4j.slf4j-api
junit.junit
org.mockito.mockito-core
org.slf4j.slf4j-log4j12
org.apache.httpcomponents.httpclient
org.apache.httpcomponents.httpmime
com.fasterxml.jackson.core.jackson-core
com.fasterxml.jackson.core.jackson-databind
com.fasterxml.jackson.core.jackson-annotations
com.github.jai-imageio.jai-imageio-core

ebruchez commented 4 years ago

+1 from customer

ebruchez commented 4 years ago

I verified that even with a plain upload field outside of Orbeon Forms, the File object's type is blank when using for example a .msg extension, while it is present for things like images.

Besides content sniffing, we could also use the filename extension to guess a content type if the browser doesn't send one, or as a replacement for the type sent by the browser.