Closed ser72 closed 10 years ago
@butonic What do you think?
Can reproduce it with the ownCloudAdminManual.pdf. Log shows
{"app":"search_lucene","message":"Cross-reference streams are not supported yet. Trace:\\n
#0 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf\/Parser.php(460): Zend_Pdf_Parser->_loadXRefTable('760683')\n#1 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(318): Zend_Pdf_Parser->__construct('%PDF-1.5?%?????...', Object(Zend_Pdf_ElementFactory_Proxy), false)\n
#2 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/3rdparty\/Zend\/Pdf.php(255): Zend_Pdf->__construct('%PDF-1.5?%?????...', NULL)\n
#3 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(174): Zend_Pdf::parse('%PDF-1.5?%?????...')\n
#4 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/lib\/indexer.php(81): OCA\\Search_Lucene\\Indexer::extractMetadata(Object(Zend_Search_Lucene_Document), '\/ownCloudAdminM...', Object(OC\\Files\\View), 'application\/pdf')\n
#5 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(45): OCA\\Search_Lucene\\Indexer::indexFile('\/ownCloudAdminM...', 'admin')\n
#6 \/home\/jfd\/Repositories\/oc\/core-stable5\/apps2\/search_lucene\/ajax\/lucene.php(77): index()\n
#7 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(700): require_once('\/home\/jfd\/Repos...')\n
#8 [internal function]: OC::loadAppScriptFile(Array)\n
#9 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/router.php(127): call_user_func(Array, Array)\n
#10 \/home\/jfd\/Repositories\/oc\/core-stable5\/lib\/base.php(629): OC_Router->match('\/apps\/search_lu...')\n
#11 \/home\/jfd\/Repositories\/oc\/core-stable5\/index.php(28): OC::handleRequest()\n
#12 {main}","level":3,"time":"2014-01-06T14:02:42+00:00"}
Closing as duplicate of https://github.com/owncloud/core/issues/6641
Well, reopening here and closing in core since its an app.
@butonic this version of zend_pdf is used? Please consider to use composer
"zendframework/zendpdf": "2.*",
My PDF version is 1.4 and it still does not work: http://i.imgur.com/HVWy2zd.png
@jfreak53 can I download the pdf somewhere? or does it contain sensible data?
Not at all, go for it please: https://cloud.microtronix-tech.com/public.php?service=files&t=c634a292a95458dbac16892afefca1f2
It's probably something stupid on my end or with the PDF file itself if it's the right version. It's a converted HTML page using wkhtmltopdf.
TL;DR: try searching for 'term*'
long version:
searching via full text: http://localhost/core-stable5/index.php/search/ajax/search.php?query=javascript gives me
[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]
searching via wildcard: http://localhost/core-stable5/index.php/search/ajax/search.php?query=java* gives me
[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]
lucene does not by default use the search term in an SQL LIKE '%term%'
statement. The drawback is that this changes what has to be put in the search field. For performance reasons, I decided to use the lucene defaults and not allow searching for partial strings or simulating the core search. You can enable searching for partial terms by uncommenting
https://github.com/owncloud/apps/blob/master/search_lucene/lib/lucene.php#L225 and prepending the line with a \
so it looks like:
\Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0);
searching via wildcards: http://localhost/core-stable5/index.php/search/ajax/search.php?query=*ava*
then gives
[{"name":"javascript - Excel PMT function in JS - Stack.pdf","text":"\/, 26.2 kB, Score: 1.00","link":"\/core-stable5\/index.php\/apps\/files\/download\/javascript%20-%20Excel%20PMT%20function%20in%20JS%20-%20Stack.pdf","type":"Files","container":null}]
If you feel performance is not a problem you can automatically add the wildcards by uncommenting lines 217 or 218-221, depending if you what kind of search terms you want to automagically use.
Another reason I commented those lines as adding wildcards will not allow using the more sophisticated query features like "excel AND javascript" or "AUTHOR:dreyer AND TITLE:something". @jancborchardt UX input might be nice here on how to approach the topic of what kind of search to allow. We might even write our own query parser to implement custom search behavior.
hope this helps.
Still not working for me: https://cloud.microtronix-tech.com/index.php/search/ajax/search.php?query=exam*
Returns: []
Is something missing on my server? The searched term should return the document as the text example
is in the PDF.
Of course java*
works as that's the name of the file. But parsing text isn't working in the PDF.
Hm, UX-wise searching for partials of words should just work, without using the asterisk* indicator. If I search for "javascr", stuff about javascript should turn up as well of course. Or am I not getting the issue here?
Ok, currently we start searching when the user has entered at least three characters. (we still highlight the currently visible files containing the search term < 3 chars). How do we tell that to the user? With a placeholder "Type at least 3 characters to start searching"? Not good enough, IMHO. Or do we show a tipsy if less than three characters are in the search field and it has the focus? Better ... @jancborchardt how should it be done?
@jfreak53 my bad. Seems the body of the pdf is garbled. Investigating.
@butonic first off, it should start searching with 2 characters, not 3. That would already alleviate the problem since probably no one expects the search to work with just one character.
@jancborchardt hm, I guess that depends on the performance. I need to find the time for a thorough analysis.
Garbled! Aha! What did the pdf converter do to it I wonder hmm.
I think 3 characters is just fine, less than that and if there are a lot of files it finds a lot of non matching stuff. Or, leave an option in the admin :) dropdown style ha ha. Leave it up to the administrator of each unit, easy peasy :)
@jfreak53 we won’t introduce options for minute things like these. ;)
I wanted to check on this, I've upgraded to the most recent version of OC recently but still it's not indexing the text. Any ideas on this one?
now tracked in https://github.com/owncloud/search_lucene/issues/14
Expected Results Find text in PDF file
Actual Results Unable to find pinpoint file
Steps Load OC6 (in this instance OC6 Enterprise Daily Build 1/6/14) Create a new file as a link to a PDF -- http://doc.owncloud.com/server/5.0EE/ownCloudAdminManual.pdf
The file is then uploaded to ownCloud
Search for "developers" for instance (appears on page 1 of the doc).
Nothing comes up.
Add a text file and search the text file for something in it, the text file is found
The App states: "We currenty support plain text, HTML and PDF files. MS Office 2007 and Open/Libre Office are on the roadmap. "
So PDF should work
{"installed":"true","version":"6.90.0.1","versionstring":"7.0 pre alpha","edition":"enterprise"}
Ubuntu PHP 5.4.23