openpreserve / nanite

Nanite - a friendly swarm of format-identifying robots.
openplanets.github.io/nanite/
15 stars 13 forks source link

detection of input streams #2

Closed dmmd closed 11 years ago

dmmd commented 11 years ago

Hey Andy,

The project I am working has to do with identifying files that are inside the disk images. The API (https://github.com/sleuthkit/sleuthkit/) allows for the streaming of a files bytestream out of the image. I'm trying to avoid having to buffer a copy of the file perform the identification. I added the following method to my local copy of DroidBinarySignatureDetector. Could you foresee other uses of already streaming file bytestreams?

String getMimeType(InputStream is, String localPath) throws FileNotFoundException, IOException, ConfigurationException, SignatureFileException {
        Metadata metadata = new Metadata();
        metadata.set(Metadata.RESOURCE_NAME_KEY, localPath);
        return this.detect(is, metadata).toString();
    }
anjackson commented 11 years ago

I think I missed this ticket - sorry. For reasons I don't understand I wasn't marked as 'watching' this repo.

I also needed InputStream support when parsing web archives, so it has been added to recent versions of Nanite. If you use the DroidDetector class, this implements the Apache Tika format detection interface, and allows you to get an extended MIME type from an InputStream (e.g. "application/pdf; version=1.4").

However, DROID's use of end-of-file signatures usually means the whole bytestream is read. For smaller resources, this fits in the BufferedInputStream, but for larger resources if will fail if DROID attempts an InputStream.reset(), and many long and ridiculous stack-traces will ensue (see #3).