internetarchive / wayback

IA's public Wayback Machine (moved from SourceForge)
757 stars 134 forks source link

Add mime type detection for replaying captures with incorrect content-type #46

Open kngenie opened 10 years ago

kngenie commented 10 years ago

(This is an issue item for already completed work) Determine mime type by looking into the payload when either mimetype in the search result is suspected to have incorrect value (ex. text/html) or missing (ex. unk).

Known internally as ARI-3822, ARI-3888, WWM-58. Bug fixes in ARI-4071 and ARI-4078.

Base work is done in commits 65dfc40 through 7d9d332, then bug fixes are being tracked on mimetype-detector branch.

anjackson commented 10 years ago

Are you basing this on an existing MIME type detection engine or standard? e.g. https://mimesniff.spec.whatwg.org/ or an implementation like Apache Tika etc.

kngenie commented 10 years ago

Nope, current implementation is a collection of heuristics found though our own study. I looked at several existing implementations and found they were too heavy weight for Wayback (ex. they tries to identify specific version of format, to extract additional metadata etc.). I wasn't aware of the WHAT-NG recommendation on mime-type sniffing. That sounds useful. Thank you (again, full implementation of this mime-type sniffing is not necessary for Wayback).

anjackson commented 10 years ago

FWIW, Apache Tika isn't too bad as long as one tika-core is on the classpath and you use the .detect() method call. It's pretty much just a 'magic number' engine that returns a MIME type. However, if tika-parsers is on the classpath it goes deeper (parsing some formats rather than just looking at the header) and so becomes less performant and less predictable (e.g. can hang on poorly-formed files).

However, Tika is probably still overkill for this use case.

kngenie commented 10 years ago

Thank you for the info on Tika. I haven't checked it out deep enough to know that behavior. Anyway, this patch introduces new interface MimeTypeDetector that should allow others to try different implementations.

I read through the WHAT-NG recommendation quickly, and got an impression SimpleMimeTypeDetector implements a good subset of it. Something missing in the recommendation is JavaScript / CSS detection "without context" - which is critical for Wayback (in fact, binary format detection in SimpleMimeTypeDetector is just to avoid unnecessary charset conversion).

kngenie commented 9 years ago

Found a backtrack-explosion problem with CSS detection regular expression. (fixed in b3b6f5a) I consider this changeset still open.

kngenie commented 9 years ago

Changes so far (up to b3b6f5a) have been merged to iipc/master in 1dbd4f3. mimetype-detector branch and this issue remains open for further bug fix / enhancements. ACC-48 is in the queue.