[OC6 Search Lucene] does not search PDF file

ser72 commented 10 years ago

Expected Results Find text in PDF file

Actual Results Unable to find pinpoint file

Steps Load OC6 (in this instance OC6 Enterprise Daily Build 1/6/14) Create a new file as a link to a PDF -- http://doc.owncloud.com/server/5.0EE/ownCloudAdminManual.pdf

The file is then uploaded to ownCloud

Search for "developers" for instance (appears on page 1 of the doc).

Nothing comes up.

Add a text file and search the text file for something in it, the text file is found

The App states: "We currenty support plain text, HTML and PDF files. MS Office 2007 and Open/Libre Office are on the roadmap. "

So PDF should work

{"installed":"true","version":"6.90.0.1","versionstring":"7.0 pre alpha","edition":"enterprise"}

Ubuntu PHP 5.4.23

karlitschek commented 10 years ago

@butonic What do you think?

butonic commented 10 years ago

Can reproduce it with the ownCloudAdminManual.pdf. Log shows

{"app":"search_lucene","message":"Cross-reference streams are not supported yet. Trace:\\n
#0 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf\/Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('760683')\n#1 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(318): Zend_Pdf_Parser->__construct('%PDF-1.5?%?????...', Object(Zend_Pdf_ElementFactory_Proxy), false)\n
#2 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(255): Zend_Pdf->__construct('%PDF-1.5?%?????...', NULL)\n
#3 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(174): Zend_Pdf::parse('%PDF-1.5?%?????...')\n
#4 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(81): OCA\\Search_Lucene\\Indexer::extractMetadata(Object(Zend_Search_Lucene_Document), '\/ownCloudAdminM...', Object(OC\\Files\\View), 'application\/pdf')\n
#5 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(45): OCA\\Search_Lucene\\Indexer::indexFile('\/ownCloudAdminM...', 'admin')\n
#6 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(77): index()\n
#7 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(700): require_once('\/home\/jfd\/Repos...')\n
#8 [internal function]: OC::loadAppScriptFile(Array)\n
#9 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/router.php(127): call_user_func(Array, Array)\n
#10 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(629): OC_Router->match('\/apps\/search_lu...')\n
#11 \/home\/jfd\/Repositories\/oc\/core-stable5\/index.php(28): OC::handleRequest()\n
#12 {main}","level":3,"time":"2014-01-06T14:02:42+00:00"}

Closing as duplicate of https://github.com/owncloud/core/issues/6641

butonic commented 10 years ago

Well, reopening here and closing in core since its an app.

DeepDiver1975 commented 10 years ago

@butonic this version of zend_pdf is used? Please consider to use composer

    "zendframework/zendpdf": "2.*",

jfreak53 commented 10 years ago

My PDF version is 1.4 and it still does not work: http://i.imgur.com/HVWy2zd.png

butonic commented 10 years ago

@jfreak53 can I download the pdf somewhere? or does it contain sensible data?

jfreak53 commented 10 years ago

Not at all, go for it please: https://cloud.microtronix-tech.com/public.php?service=files&t=c634a292a95458dbac16892afefca1f2

It's probably something stupid on my end or with the PDF file itself if it's the right version. It's a converted HTML page using wkhtmltopdf.

butonic commented 10 years ago

TL;DR: try searching for 'term*'

long version:

searching via full text: http://localhost/core-stable5/index.php/search/ajax/search.php?query=javascript gives me

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

searching via wildcard: http://localhost/core-stable5/index.php/search/ajax/search.php?query=java* gives me

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

lucene does not by default use the search term in an SQL LIKE '%term%' statement. The drawback is that this changes what has to be put in the search field. For performance reasons, I decided to use the lucene defaults and not allow searching for partial strings or simulating the core search. You can enable searching for partial terms by uncommenting https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L225 and prepending the line with a \ so it looks like:

\Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0);

searching via wildcards: http://localhost/core-stable5/index.php/search/ajax/search.php?query=*ava* then gives

[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]

If you feel performance is not a problem you can automatically add the wildcards by uncommenting lines 217 or 218-221, depending if you what kind of search terms you want to automagically use.

Another reason I commented those lines as adding wildcards will not allow using the more sophisticated query features like "excel AND javascript" or "AUTHOR:dreyer AND TITLE:something". @jancborchardt UX input might be nice here on how to approach the topic of what kind of search to allow. We might even write our own query parser to implement custom search behavior.

hope this helps.

jfreak53 commented 10 years ago

Still not working for me: https://cloud.microtronix-tech.com/index.php/search/ajax/search.php?query=exam*

Returns: []

Is something missing on my server? The searched term should return the document as the text example is in the PDF.

Of course java* works as that's the name of the file. But parsing text isn't working in the PDF.

jancborchardt commented 10 years ago

Hm, UX-wise searching for partials of words should just work, without using the asterisk* indicator. If I search for "javascr", stuff about javascript should turn up as well of course. Or am I not getting the issue here?

butonic commented 10 years ago

Ok, currently we start searching when the user has entered at least three characters. (we still highlight the currently visible files containing the search term < 3 chars). How do we tell that to the user? With a placeholder "Type at least 3 characters to start searching"? Not good enough, IMHO. Or do we show a tipsy if less than three characters are in the search field and it has the focus? Better ... @jancborchardt how should it be done?

butonic commented 10 years ago

@jfreak53 my bad. Seems the body of the pdf is garbled. Investigating.

jancborchardt commented 10 years ago

@butonic first off, it should start searching with 2 characters, not 3. That would already alleviate the problem since probably no one expects the search to work with just one character.

butonic commented 10 years ago

@jancborchardt hm, I guess that depends on the performance. I need to find the time for a thorough analysis.

jfreak53 commented 10 years ago

Garbled! Aha! What did the pdf converter do to it I wonder hmm.

I think 3 characters is just fine, less than that and if there are a lot of files it finds a lot of non matching stuff. Or, leave an option in the admin :) dropdown style ha ha. Leave it up to the administrator of each unit, easy peasy :)

jancborchardt commented 10 years ago

@jfreak53 we won’t introduce options for minute things like these. ;)

jfreak53 commented 10 years ago

I wanted to check on this, I've upgraded to the most recent version of OC recently but still it's not indexing the text. Any ideas on this one?

butonic commented 10 years ago

now tracked in https://github.com/owncloud/search_lucene/issues/14

owncloud-archive / apps

[OC6 Search Lucene] does not search PDF file #1591