[OC6 Search Lucene] does not search PDF file

butonic commented 10 years ago

Issue by ser72 from Monday Jan 06, 2014 at 13:32 GMT Originally opened as https://github.com/owncloud/apps/issues/1591

Expected Results Find text in PDF file

Actual Results Unable to find pinpoint file

Steps Load OC6 (in this instance OC6 Enterprise Daily Build 1/6/14) Create a new file as a link to a PDF -- http://doc.owncloud.com/server/5.0EE/ownCloudAdminManual.pdf

The file is then uploaded to ownCloud

Search for "developers" for instance (appears on page 1 of the doc).

Nothing comes up.

Add a text file and search the text file for something in it, the text file is found

The App states: "We currenty support plain text, HTML and PDF files. MS Office 2007 and Open/Libre Office are on the roadmap. "

So PDF should work

{"installed":"true","version":"6.90.0.1","versionstring":"7.0 pre alpha","edition":"enterprise"}

Ubuntu PHP 5.4.23

butonic commented 10 years ago

Comment by karlitschek from Monday Jan 06, 2014 at 13:56 GMT

@butonic What do you think?

butonic commented 10 years ago

Comment by butonic from Monday Jan 06, 2014 at 14:08 GMT

Can reproduce it with the ownCloudAdminManual.pdf. Log shows

{"app":"search_lucene","message":"Cross-reference streams are not supported yet. Trace:\\n
#0 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf\/Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('760683')\n#1 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(318): Zend_Pdf_Parser->__construct('%PDF-1.5?%?????...', Object(Zend_Pdf_ElementFactory_Proxy), false)\n
#2 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(255): Zend_Pdf->__construct('%PDF-1.5?%?????...', NULL)\n
#3 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(174): Zend_Pdf::parse('%PDF-1.5?%?????...')\n
#4 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(81): OCA\\Search_Lucene\\Indexer::extractMetadata(Object(Zend_Search_Lucene_Document), '\/ownCloudAdminM...', Object(OC\\Files\\View), 'application\/pdf')\n
#5 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(45): OCA\\Search_Lucene\\Indexer::indexFile('\/ownCloudAdminM...', 'admin')\n
#6 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(77): index()\n
#7 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(700): require_once('\/home\/jfd\/Repos...')\n
#8 [internal function]: OC::loadAppScriptFile(Array)\n
#9 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/router.php(127): call_user_func(Array, Array)\n
#10 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(629): OC_Router->match('\/apps\/search_lu...')\n
#11 \/home\/jfd\/Repositories\/oc\/core-stable5\/index.php(28): OC::handleRequest()\n
#12 {main}","level":3,"time":"2014-01-06T14:02:42+00:00"}

Closing as duplicate of https://github.com/owncloud/core/issues/6641

butonic commented 10 years ago

Comment by butonic from Monday Jan 06, 2014 at 14:09 GMT

Well, reopening here and closing in core since its an app.

butonic commented 10 years ago

Comment by DeepDiver1975 from Monday Jan 06, 2014 at 19:18 GMT

@butonic this version of zend_pdf is used? Please consider to use composer

    "zendframework/zendpdf": "2.*",

butonic commented 10 years ago

Comment by jfreak53 from Wednesday Jan 08, 2014 at 15:54 GMT

My PDF version is 1.4 and it still does not work: http://i.imgur.com/HVWy2zd.png

butonic commented 10 years ago

Comment by butonic from Wednesday Jan 08, 2014 at 16:37 GMT

@jfreak53 can I download the pdf somewhere? or does it contain sensible data?

butonic commented 10 years ago

Comment by jfreak53 from Wednesday Jan 08, 2014 at 16:43 GMT

Not at all, go for it please: https://cloud.microtronix-tech.com/public.php?service=files&t=c634a292a95458dbac16892afefca1f2

It's probably something stupid on my end or with the PDF file itself if it's the right version. It's a converted HTML page using wkhtmltopdf.

butonic commented 10 years ago

Comment by butonic from Wednesday Jan 08, 2014 at 17:23 GMT

TL;DR: try searching for 'term*'

long version:

searching via full text: http://localhost/core-stable5/index.php/search/ajax/search.php?query=javascript gives me

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

searching via wildcard: http://localhost/core-stable5/index.php/search/ajax/search.php?query=java* gives me

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

lucene does not by default use the search term in an SQL LIKE '%term%' statement. The drawback is that this changes what has to be put in the search field. For performance reasons, I decided to use the lucene defaults and not allow searching for partial strings or simulating the core search. You can enable searching for partial terms by uncommenting https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L225 and prepending the line with a \ so it looks like:

\Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0);

searching via wildcards: http://localhost/core-stable5/index.php/search/ajax/search.php?query=*ava* then gives

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

If you feel performance is not a problem you can automatically add the wildcards by uncommenting lines 217 or 218-221, depending if you what kind of search terms you want to automagically use.

Another reason I commented those lines as adding wildcards will not allow using the more sophisticated query features like "excel AND javascript" or "AUTHOR:dreyer AND TITLE:something". @jancborchardt UX input might be nice here on how to approach the topic of what kind of search to allow. We might even write our own query parser to implement custom search behavior.

hope this helps.

butonic commented 10 years ago

Comment by jfreak53 from Wednesday Jan 08, 2014 at 17:38 GMT

Still not working for me: https://cloud.microtronix-tech.com/index.php/search/ajax/search.php?query=exam*

Returns: []

Is something missing on my server? The searched term should return the document as the text example is in the PDF.

Of course java* works as that's the name of the file. But parsing text isn't working in the PDF.

butonic commented 10 years ago

Comment by jancborchardt from Thursday Jan 09, 2014 at 10:58 GMT

Hm, UX-wise searching for partials of words should just work, without using the asterisk* indicator. If I search for "javascr", stuff about javascript should turn up as well of course. Or am I not getting the issue here?

butonic commented 10 years ago

Comment by butonic from Thursday Jan 09, 2014 at 12:02 GMT

Ok, currently we start searching when the user has entered at least three characters. (we still highlight the currently visible files containing the search term < 3 chars). How do we tell that to the user? With a placeholder "Type at least 3 characters to start searching"? Not good enough, IMHO. Or do we show a tipsy if less than three characters are in the search field and it has the focus? Better ... @jancborchardt how should it be done?

butonic commented 10 years ago

Comment by butonic from Thursday Jan 09, 2014 at 12:03 GMT

@jfreak53 my bad. Seems the body of the pdf is garbled. Investigating.

butonic commented 10 years ago

Comment by jancborchardt from Thursday Jan 09, 2014 at 12:09 GMT

@butonic first off, it should start searching with 2 characters, not 3. That would already alleviate the problem since probably no one expects the search to work with just one character.

butonic commented 10 years ago

Comment by butonic from Thursday Jan 09, 2014 at 12:11 GMT

@jancborchardt hm, I guess that depends on the performance. I need to find the time for a thorough analysis.

butonic commented 10 years ago

Comment by jfreak53 from Thursday Jan 09, 2014 at 13:55 GMT

Garbled! Aha! What did the pdf converter do to it I wonder hmm.

I think 3 characters is just fine, less than that and if there are a lot of files it finds a lot of non matching stuff. Or, leave an option in the admin :) dropdown style ha ha. Leave it up to the administrator of each unit, easy peasy :)

butonic commented 10 years ago

Comment by jancborchardt from Friday Jan 10, 2014 at 16:31 GMT

@jfreak53 we won’t introduce options for minute things like these. ;)

butonic commented 10 years ago

Comment by jfreak53 from Tuesday Feb 11, 2014 at 13:26 GMT

I wanted to check on this, I've upgraded to the most recent version of OC recently but still it's not indexing the text. Any ideas on this one?

mikeazo commented 10 years ago

I was having a similar issue where PDF files were not found in search results. txt and docx files are, however. As a temporary fix, I added some code to document/Pdf.php where if Zend_pdf::parse throws an exception, simply call $pdfParse->pdf2txt($data) to get the text out of the PDF and add that as the 'body' field instead (no other fields are added). That appears to be working for me.

I end up missing the meta-data, but at least I can search on the text.

jfreak53 commented 10 years ago

Could you post that code you used? Maybe this will work for me as a temp fix.

mikeazo commented 10 years ago

Instead of creating a pull request, I'll just put the patch inline below. Basically I created a variable to determine whether or not the original code worked. If it didn't, I just pass $data into pdf2txt. The code could be cleaned up quite a bit, but hopefully this will do for a quick fix.

The thing I don't know how to do yet is force OwnCloud to reindex already uploaded files. Any updated files will be reindexed next time cron.php is run.

--- Pdf.php.orig    2014-02-26 14:25:43.000000000 -0500
+++ Pdf.php 2014-02-26 14:26:38.000000000 -0500
@@ -16,7 +16,7 @@
      * @param boolean $storeContent
      */
     private function __construct($data, $storeContent) {
-
+        $done = false;
        try {
            $zendpdf = \Zend_Pdf::parse($data);

@@ -48,6 +48,7 @@
                    $this->addField(\Zend_Search_Lucene_Field::UnStored('body', $body, 'UTF-8'));
                }
            }
+            $done = true;

        } catch (\Exception $e) {
            Util::writeLog('search_lucene',
@@ -55,6 +56,29 @@
                Util::ERROR);
        }

+        try {
+            if (!$done) {
+                // Zend failed, do something much simpler
+               $pdfParse = new \App_Search_Helper_PdfParser();
+                $body = $pdfParse->pdf2txt($data); 
+                
+                if (!empty($body)) {
+                    if ($storeContent) {
+                        $this->addField(\Zend_Search_Lucene_Field::Text('body', $body, 'UTF-8'));
+                    } else {
+                        $this->addField(\Zend_Search_Lucene_Field::UnStored('body', $body, 'UTF-8'));
+                    }
+                } else {
+                    Util::writeLog('search_lucene',
+                        ' Trace:\nNothing returned by pdf2txt',
+                        Util::ERROR);
+                } 
+            }
+        } catch (\Exception $e) {
+            Util::writeLog('search_lucene',
+                $e->getMessage() . ' Trace:\n' . $e->getTraceAsString(),
+                Util::ERROR);
+        }
     }

     /**

butonic commented 10 years ago

v0.6.0 uses a different parser

eslindsey commented 9 years ago

@butonic I am having this problem with some PDF files generated by Fujitsu ScanSnap ix500. I am currently dealing with a volume of about 50-200 signed credit card slips per day, which I get into ownCloud by dropping them via a mapped drive directly into the files directory from Windows to Linux. Problem is, the full contents don't seem to be getting indexed (sometimes I can find the first few words in the document, but not always, and I can never find based on words appearing later in the document). We are talking about text that is about 50 lines per document. If I run pdftotext and add txt files along side the pdf files, they get indexed and returned in search results properly. How can I help troubleshoot this issue so I don't have to store 2 copies of each file (1 txt, 1 pdf)?

Update: SOME of the text that I am looking for IS working now. Still, I have an identical set of PDF files and TXT files side by side, and typing in one word gives me many more results for the TXT documents than it does for the PDF documents.

owncloud-archive / search_lucene

[OC6 Search Lucene] does not search PDF file #14