gitblit-org / gitblit

pure java git solution
http://gitblit.com
Apache License 2.0
2.28k stars 670 forks source link

Indexing PDFs and other binary files #1026

Open dsteinkopf opened 8 years ago

dsteinkopf commented 8 years ago

Hello,

have you ever thought about adding content from pdf files and other bin files to the lucene search index? I think using a library like Apache Tika could make this not too difficult.

BTW. Is there any reason why the file names itself are not indexed?

Background: I am thinking about using git/gitblit as a document archive for PDFs and having a full text search index would be great.

Any thoughts?

paulsputer commented 8 years ago

@dsteinkopf It's an interesting idea though I'm not sure how effective it would turn out to be. The search interface would need some changes probably a code / doc search switch at minimum.

Here is a minimal proof of concept for PDF that resolves the first stage of accessing the blobs from the filestore and updating the index after a commit. Most of the changes are caused by updated dependencies for Tika.

A few challenges I would anticipate moving forward on this:

1) Server loading CPU and RAM - PDFs do take significant parsing and will usually be much larger than code files. 2) Content error rate - The parsing is error prone, it's digital content, but not quite as we know it (if you've seen the PDF spec you'll know why ;) ). 3) Parsing failure rate - If the parsing fails Tika seems to abort rather than skip content it doesn't understand, maybe that's in some configuration. 4) Just in some basic testing I noticed renaming a file causes both the old and new to be indexed.

You're welcome to take this as a starting point if you'd like to work on it further for a potential PR. :)