Indexing PDFs and other binary files

@dsteinkopf It's an interesting idea though I'm not sure how effective it would turn out to be. The search interface would need some changes probably a code / doc search switch at minimum.

Here is a minimal proof of concept for PDF that resolves the first stage of accessing the blobs from the filestore and updating the index after a commit. Most of the changes are caused by updated dependencies for Tika.

A few challenges I would anticipate moving forward on this:

1) Server loading CPU and RAM - PDFs do take significant parsing and will usually be much larger than code files. 2) Content error rate - The parsing is error prone, it's digital content, but not quite as we know it (if you've seen the PDF spec you'll know why ;) ). 3) Parsing failure rate - If the parsing fails Tika seems to abort rather than skip content it doesn't understand, maybe that's in some configuration. 4) Just in some basic testing I noticed renaming a file causes both the old and new to be indexed.

You're welcome to take this as a starting point if you'd like to work on it further for a potential PR. :)

gitblit-org / gitblit

Indexing PDFs and other binary files #1026