Use Tika's MediaTypes instead of self parsing strings

This is Jukka's answer about this subject:
Hi,

On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <avraham2@gmail.com> wrote:
> How do I identify content types which can't be read as text (in notepad for
> example) because they have some binary content in them.

You can use use the media type relationship information stored in
Tika's type registry, like this:

    Tika tika = new Tika();
    MediaType type = MediaType.parse(tika.detect(...));

    MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
    if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
        // process text
    } else {
        // process binary
    }

> [...] if it finds text-parsable content, I want it to take the content as it 
is

Note that consuming text data can be surprisingly difficult given all
the different character encodings out there. Tika's parser classes
contain quite a bit of logic for automatically figuring out the
correct character encoding and other details needed for correctly
consuming text data.

What's your reason for wanting to process text data separately? Is
there some missing feature in Tika that would help achieve your use
case without the need for custom processing of text data?

For example the HTML parser supports the IdentityHtmlMapper feature
for skipping the HTML simplification that Tika does by default. To
activate that feature, you can pass an IdentityHtmlMapper instance in
the parse context:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper();

--
Jukka Zitting

Original comment by avrah...@gmail.com on 17 Aug 2014 at 5:00

sawantuday / crawler4j

Use Tika's MediaTypes instead of self parsing strings #280