sleuthkit / autopsy

Autopsy® is a digital forensics platform and graphical interface to The Sleuth Kit® and other digital forensics tools. It can be used by law enforcement, military, and corporate examiners to investigate what happened on a computer. You can even use it to recover photos from your camera's memory card.
http://www.sleuthkit.org/autopsy/
2.44k stars 594 forks source link

Error on module KeywordSearch #2441

Open peyobr opened 7 years ago

peyobr commented 7 years ago

Hi, sucessfully run Autopsy 4.2.0 on my Linux Mint 18 workstation (Intel core i7, 16GB Ram).

When start ingesting i always got errors error on .docx, .xslx files (erros such the following...)

Is it normal?

Dec 22, 2016 11:59:29 AM org.sleuthkit.autopsy.keywordsearch.TikaTextExtractor index WARNING: Exception: Unable to read Tika content stream from 414089: MyFile.docx java.io.IOException: at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:260) at org.sleuthkit.autopsy.keywordsearch.TikaTextExtractor.index(TikaTextExtractor.java:165) at org.sleuthkit.autopsy.keywordsearch.KeywordSearchIngestModule$Indexer.extractTextAndIndex(KeywordSearchIngestModule.java:435) at org.sleuthkit.autopsy.keywordsearch.KeywordSearchIngestModule$Indexer.indexFile(KeywordSearchIngestModule.java:563) at org.sleuthkit.autopsy.keywordsearch.KeywordSearchIngestModule$Indexer.access$100(KeywordSearchIngestModule.java:400) at org.sleuthkit.autopsy.keywordsearch.KeywordSearchIngestModule.process(KeywordSearchIngestModule.java:255) at org.sleuthkit.autopsy.ingest.FileIngestPipeline$PipelineModule.process(FileIngestPipeline.java:219) at org.sleuthkit.autopsy.ingest.FileIngestPipeline.process(FileIngestPipeline.java:125) at org.sleuthkit.autopsy.ingest.DataSourceIngestJob.process(DataSourceIngestJob.java:769) at org.sleuthkit.autopsy.ingest.FileIngestTask.execute(FileIngestTask.java:44) at org.sleuthkit.autopsy.ingest.IngestManager$ExecuteIngestJobsTask.run(IngestManager.java:1016) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: org.apache.poi.xwpf.usermodel.XWPFParagraph.getIRuns()Ljava/util/List; at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:205) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaderText(XWPFWordExtractorDecorator.java:382) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractHeaders(XWPFWordExtractorDecorator.java:366) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:82) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:105) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:221) ... 1 more

peyobr commented 7 years ago

To be honest it's a WARNING...

but still i am wondering if it exists the chance that TIKA can index .docx and .xslx, or maybe something is going wrong on my build.

Thank you!

rcordovano commented 7 years ago

This is a normal warning. We have debated whether or not to log the warning at all, since it can be alarming and can happen in various circumstances - file content can be corrupted, in a new format, etc.. Every time this has come up, we have decided to leave the warnings for the sake of the advanced user who wants to comb through the logs to get a detailed picture of what happened during analysis by the ingest modules.

Autopsy uses Tika as one of several ways to extract text to send to Solr for indexing. If you see this warning, then Tika has indicated it can extract text from the specified file type, so Autopsy has selected Tika as the text extractor for the given file, but Tika has been unable to parse it. When this happens, Autopsy falls back to extracting strings from the file.

Richard Cordovano Autopsy/Autopsy Customization Team Lead Basis Technology

On Thu, Dec 22, 2016 at 7:47 AM, peyobr notifications@github.com wrote:

To be honest it's a WARNING...

but still i am wondering if it exists the chance that TIKA can index .docx and .xslx, or maybe something is going wrong on my build.

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sleuthkit/autopsy/issues/2441#issuecomment-268792825, or mute the thread https://github.com/notifications/unsubscribe-auth/ABolxc4vd_3QW4z2OT4elaVcJorBowkYks5rKnFrgaJpZM4LT0vg .