tspannhw / nifi-extracttext-processor

Apache NiFi Custom Processor Extracting Text From Files with Apache Tika
Apache License 2.0
35 stars 29 forks source link

ZeroByteFileException from Tika? #1

Closed willydlilly closed 6 years ago

willydlilly commented 6 years ago

@tspannhw When using this code, I am able to get the unit tests to work just fine and return data after the enqueue/run methods are called. Seems to be working just fine. But once I deploy to Nifi, I keep getting this Tika ZeroByteFileException message "InputStream must have > 0 bytes." This is after sending in the same pdf file used for the unit tests. I can't seem to find any information about this...

I have confirmed from a post by Brian Bende that the nar packages up all required libraries, and I have even unzipped the nar to verify that the Tika libraries were included. Nifi starts up fine, so I really don't think it's a missing library issue. The processor is accessible in the Nifi UI and can be configured. It just doesn't seem to get the input properly.

Was there any additional installation tasks for your processor other than dropping the nar in the /nifi/lib/ dir? I think Tika does allow custom configurations through xml files- did you have to specify a custom config at all? I can't seem to make any sense of this exception and figure it must be an install issue. Any thoughts?

I'm using Nifi 1.5.0, Tika 1.17, JDK 8. I also have pdfbox 2.0.8 there.

*Note- I also have a simple pdfbox based custom processor hooked up in parallel in the Nifi flow. This processor gets the pdf input file, reads it just fine, and parses the output. So I suppose that eliminates any potential issue with Nifi not "delivering" the input file as a Java IO InputStream properly.

tspannhw commented 6 years ago

See my article and example here: https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html

Try this setup https://community.hortonworks.com/storage/attachments/56409-tika.xml

https://community.hortonworks.com/articles/81694/extracttext-nifi-custom-processor-powered-by-apach.html

Post a question on the community site if this is not working with a screen shot and any error messages

Which version of NiFi are you using? Also you need to use JDK 8.

No custom anything

willydlilly commented 6 years ago

Thanks for the lightening fast response- I didn't expect that! I will look into the links you sent. I also updated the initial post with version info.

willydlilly commented 6 years ago

Guy downloads your code. Guy modifies your code. Guy blames you for broken code...

Same story- different day. :)

I thought it would be useful to not just convert the input doc to text, but also to store the original file's mime type as an attribute. As such, I added a line: mimeType = tika.detect(inputStream, filename);

The Tika documentation says that the detector will mark/reset the stream, but this does NOT seem to happen. After a ton of searching around, I stumbled on this thread which highlights the problem, which is that the Tika Detector seems to be consuming the input stream. So in the future if you decide to do multiple Tika "stuff" on the inputStream, then consider wrapping in a BufferedInputStream first. This did correct my issue.

BufferedInputStream buffStream = new BufferedInputStream(inputStream);
...
mimeType = tika.detect(buffStream, filename);
text = tika.parseToString(buffStream);
...
buffStream.close();
tspannhw commented 6 years ago

Good to know. If you want to fork it and add some goodies. I was going to update to the latest version of Apache Tika and see what else could be added.

tspannhw commented 6 years ago

https://community.hortonworks.com/content/kbentry/177370/extracting-html-from-pdf-excel-and-word-documents.html