Open kngenie opened 10 years ago
Are you basing this on an existing MIME type detection engine or standard? e.g. https://mimesniff.spec.whatwg.org/ or an implementation like Apache Tika etc.
Nope, current implementation is a collection of heuristics found though our own study. I looked at several existing implementations and found they were too heavy weight for Wayback (ex. they tries to identify specific version of format, to extract additional metadata etc.). I wasn't aware of the WHAT-NG recommendation on mime-type sniffing. That sounds useful. Thank you (again, full implementation of this mime-type sniffing is not necessary for Wayback).
FWIW, Apache Tika isn't too bad as long as one tika-core is on the classpath and you use the .detect() method call. It's pretty much just a 'magic number' engine that returns a MIME type. However, if tika-parsers is on the classpath it goes deeper (parsing some formats rather than just looking at the header) and so becomes less performant and less predictable (e.g. can hang on poorly-formed files).
However, Tika is probably still overkill for this use case.
Thank you for the info on Tika. I haven't checked it out deep enough to know that behavior.
Anyway, this patch introduces new interface MimeTypeDetector
that should allow others to try different implementations.
I read through the WHAT-NG recommendation quickly, and got an impression SimpleMimeTypeDetector
implements a good subset of it. Something missing in the recommendation is JavaScript / CSS detection "without context" - which is critical for Wayback (in fact, binary format detection in SimpleMimeTypeDetector is just to avoid unnecessary charset conversion).
Found a backtrack-explosion problem with CSS detection regular expression. (fixed in b3b6f5a) I consider this changeset still open.
(This is an issue item for already completed work) Determine mime type by looking into the payload when either
mimetype
in the search result is suspected to have incorrect value (ex.text/html
) or missing (ex.unk
).Known internally as ARI-3822, ARI-3888, WWM-58. Bug fixes in ARI-4071 and ARI-4078.
Base work is done in commits 65dfc40 through 7d9d332, then bug fixes are being tracked on mimetype-detector branch.