Open dsteinkopf opened 8 years ago
@dsteinkopf It's an interesting idea though I'm not sure how effective it would turn out to be. The search interface would need some changes probably a code / doc search switch at minimum.
Here is a minimal proof of concept for PDF that resolves the first stage of accessing the blobs from the filestore and updating the index after a commit. Most of the changes are caused by updated dependencies for Tika.
A few challenges I would anticipate moving forward on this:
1) Server loading CPU and RAM - PDFs do take significant parsing and will usually be much larger than code files. 2) Content error rate - The parsing is error prone, it's digital content, but not quite as we know it (if you've seen the PDF spec you'll know why ;) ). 3) Parsing failure rate - If the parsing fails Tika seems to abort rather than skip content it doesn't understand, maybe that's in some configuration. 4) Just in some basic testing I noticed renaming a file causes both the old and new to be indexed.
You're welcome to take this as a starting point if you'd like to work on it further for a potential PR. :)
Hello,
have you ever thought about adding content from pdf files and other bin files to the lucene search index? I think using a library like Apache Tika could make this not too difficult.
BTW. Is there any reason why the file names itself are not indexed?
Background: I am thinking about using git/gitblit as a document archive for PDFs and having a full text search index would be great.
Any thoughts?