Closed critzinger closed 6 years ago
Can you check the log of your Solr ? You might have more informations on the file it is currently indexing.
In the file solr_gc.log.1.current are numerous entries:
... 2017-02-23T21:36:45.094+0100: 1303861,253: Total time for which application threads were stopped: 0,0003022 seconds, Stopping threads took: 0,0000627 seconds 2017-02-23T21:36:45.099+0100: 1303861,257: Total time for which application threads were stopped: 0,0001651 seconds, Stopping threads took: 0,0000402 seconds 2017-02-23T21:36:45.099+0100: 1303861,258: Total time for which application threads were stopped: 0,0001103 seconds, Stopping threads took: 0,0000246 seconds 2017-02-23T21:36:45.100+0100: 1303861,258: Total time for which application threads were stopped: 0,0001118 seconds, Stopping threads took: 0,0000287 seconds 2017-02-23T21:36:45.100+0100: 1303861,258: Total time for which application threads were stopped: 0,0001141 seconds, Stopping threads took: 0,0000283 seconds ...
In the file solr.log i can see
2017-02-23 20:43:49.528 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 end_commit_flush 2017-02-23 20:44:17.295 INFO (qtp1355531311-13) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX1.pdf&literal.id=files127948&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignored&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX1/XXX1/XXX1.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.content=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&literal.nextant_mtime=1474441919&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fmap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX1@XXX1.de&omitHeader=true&literal.nextantdeleted=false&capture Attr=true&fmap.input=ignored&fmap.span=ignored_}{add=[files_127948 (1560158045244227584)]} 0 74 2017-02-23 20:44:17.303 INFO (qtp1355531311-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=,score&start=0&rows=10&wt=json} status=0 QTime=4 2017-02-23 20:45:01.464 INFO (qtp1355531311-15) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX2.pdf&literal.id=files26672&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignored&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX2/XXX2/XXX2.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.content=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&literal.nextant_mtime=1431428171&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fmap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX2@XXX2.de&omitHeader=true&literal.nextantdeleted=false&captureAttr=true&fmap.input=ignored&fmap.span=ignored_}{add=[files_26672 (1560158091565072384)]} 0 52 2017-02-23 20:45:01.471 INFO (qtp1355531311-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=,score&start=0&rows=10&wt=json} status=0 QTime=5 2017-02-23 20:45:17.295 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 start commit{,optimize=false,openSea rcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2017-02-23 20:45:17.295 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.SolrIndexWriter Calling setCommitData with IW:org.apache. solr.update.SolrIndexWriter@25f4cbb1 2017-02-23 20:45:17.384 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.s.SolrIndexSearcher Opening [Searcher@221d07b2[nextant] rea ltime] 2017-02-23 20:45:17.385 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 end_commit_flush 2017-02-23 20:45:44.923 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Bol dMT' 2017-02-23 20:45:44.924 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'ArialMT' 2017-02-23 20:45:44.927 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Bol dItalicMT' 2017-02-23 20:45:44.933 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Ita licMT' 2017-02-23 20:45:44.986 INFO (qtp1355531311-20) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX3.pdf&literal.id=files100151&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignor ed&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX3/XXX3/XXX3.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.c ontent=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&lite ral.nextant_mtime=1462777393&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fm ap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX3@XXX3.de&omitHeader=true&literal.nextantdeleted=false&captureAttr=t rue&fmap.input=ignored&fmap.span=ignored_}{add=[files_100151 (1560158137183371264)]} 0 113 2017-02-23 20:45:44.996 INFO (qtp1355531311-12) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=*,score&start=0&rows=10&wt=json} status=0 QTime=4
are this local files ? or from a specific file system that you globally share ? If the file are local and on the same filesystem than owncloud, what are the size of the pdfs ?
yes all files are local under /var/www/html/owncloud/data the size of the files between a few KBs - approx 5 - 10 MBs
I would kill the current indexing process, restart Solr and start the ./occ nextant:index
again with the --debug option and see how it goes
Do you have any specific configuration regarding your cloud or solr (like tesseract) ? If it is still slow, add .pdf to the extension filters and start indexing again to see how it react
The indexing that has been running for a week already uses the --debug parameter
There will be a message every 200 - 300 documents
"" Error # 21 (HTTP request failed, operation timed out after 30000 milliseconds with 0 bytes received) "" Failed to extract document # 140325
This is the only error message that is displayed.
As I said Our Owncloud installation contains 50,000 documents, about 900 users and about 30 groups.
The configuration of solr I have left on standard.
I installed Solr as a service as follows: https://github.com/nextcloud/nextant/wiki/Setup-your-local-Solr-as-a-Service https://github.com/nextcloud/nextant/wiki/And-some-more-...
in "TesseractOCRConfig.properties" i have changed language=eng to language=deu
SOLR_HEAP = "6144m"
You mean ignor .pdf extension?
50k document is not an issue, the number of users neither. But can you uninstall Tesseract, restart Solr and start again your index ?
nothing changed :(
he hangs for a long time by [scanning] (PHP Process runs on 100%) [extracting] i can see only 1 or 2 seconds and then he change back to [scanning]
do you have another advice for me?
Can you paste the last 4 lines of a
./occ nextant:check
(and confirm all your lines are green)
Your solr contains 23786 documents :
all lines are green
try
./occ nextant:optimize
also put 60 seconds as timeout also from the Admin interface.
I am sorry if I can't take more time to help you right now
any update on that slow scan ?
Excuse me i was not in the office for the last two weeks
If i want to set the timeout to 60 seconds and click save He tests the settings All tests are green
The timeout then jumps back to 30 seconds - i also can´t set the timeout to 60 seconds
the ./occ nextant:optimize has nothing changed :(
i have found another problem
when i search in a subfolder outputting the search results with already indexed files the path is wrong in the href=
example:
The result is then in the href = "https://cloud.de/owncloud/remote.php/webdav/public/not_public/test.ppt"
The correct path is https://cloud.de/owncloud/remote.php/webdav/not_public/test.ppt
when i searching on a root folder then the output is okay
Hello daita I have analyzed the problem still further. I deleted the index again and started to recreate it.
At the beginning, approximately 30 documents / minute were indexed. After 13700 documents, he indexes only 3 documents / minute.
The SOLR database has meanwhile reached a size of 2.5 GB. (13700 documents).
Nextant makes a commit after each document. The next document is scanned only after the commit has been completed.
Can the commit be configured to run after 100 documents?
I think this could speed up the scan.
The larger the database, the longer the commit takes.
The commit should be done every n seconds (30) or 1 minute. It shouldn't be that slow after 14k documents. Check the number of segments of your index, and see if your commit is not timing out
Your solr contains 13175 documents :
/admin/files 13764/ 42472 [=========>------------------] 32% (00:28:16) [scanning] Solr memory: 605.9 MB (%10.3) 3 documents extracted in the last minute. Last commit took 58ms
no timeout
php runs permantly on 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2860 www-data 20 0 1846332 0,993g 59564 R 99,6 6,3 2816:49 php
try ./occ nextant:optimize
in another terminal.
If it fail, stop the current index, optimize, and start again your index. If you do not use the -- force
argument, the index should not re-scan your previously scanned document
after optimizing:
Your solr contains 13222 documents :
speed is not changed :(
This is really strange, you should test to inject documents directly into your Solr, see if it takes that much time. This way we should know if issue is with Nextant/Owncloud or Solr.
how can i inject the documents directly?
have you set up an access to the web interface of your solr ?
yes
open the web interface, select the Core, and check the 'Documents' sub menu If I remember well you can upload file from here
i have
request-Handler (qt)
Document Type (Dropdown list) possible values:
Document(s) (textfield)
commit within
overwrite
boost
can i define a absolute path like /var/WWW/owncloud/data/admin/files?
change the option from Document Type to File Upload
okay but i can upload only one single file from my local pc?
You should get a huge pdf file from the internet, like the Apache Solr Documentation which is 10+MB, and see if it takes a while to be extracted
okay i got the following error :(
I think you have to edit Request-Handler: /update/extract/ or /update/extract
and now i got the following error :(
can you paste the result of ./occ nextant:check --info
?
array ( 'instant' => true, 'configured' => '1', 'ping' => true, 'solr_url' => 'HIDDEN', 'solr_core' => 'HIDDEN', 'solr_timeout' => '120', 'nextant_version' => '1.0.6', 'index_files' => '1', 'index_files_needed' => '0', 'index_files_max_size' => '40', 'index_files_tree' => '0', 'index_files_nextant_only' => '0', 'index_files_trash' => '0', 'index_files_sharelink' => '0', 'index_files_federated' => '0', 'index_files_external' => '0', 'index_files_encrypted' => '0', 'index_files_filters_text' => '1', 'index_files_filters_pdf' => '1', 'index_files_filters_office' => '1', 'index_files_filters_image' => '1', 'index_files_filters_audio' => '0', 'index_files_filters_extensions' => '.csv', 'current_docs' => 13257, 'current_segments' => 5, 'files_external_app_enabled' => false, 'bookmarks_app_enabled' => false, 'index_bookmarks' => 0, 'index_bookmarks_needed' => '0', 'resource_level' => '4', 'index_live' => '2', 'use_cron' => '0', 'index_delay_min' => '2', 'index_delay_max' => '4', 'index_locked' => '0', 'index_files_last' => '0', 'index_files_last_format' => 'Thu, 01 Jan 1970 00:00:00 +0000', 'index_bookmarks_last' => '0', 'index_bookmarks_last_format' => 'Thu, 01 Jan 1970 00:00:00 +0000', 'ext_sysvmsg_loaded' => true, 'source' => 'check', 'cloud_version' => '9.1.4.2', ) Pinging 127.0.0.1:8983/solr/nextant : ok
Checking Solr schema fields
Cleaning extra fields
Your solr contains 13257 documents :
the Cleaning extra fields comes after manual Import from above
i made ./occ nextant:check --fix
then i made a manual import again (same failure) the error with the cleaning extra fields comes again
remove the filters on Image files from the Admin Interface, we never know with tesseract
Also, run another ./occ nextant:check
to see if every field are fixed after a --fix.
Please try to manually import a simple text file
a simple Textfile make no Problems
I have again investigated the problem and i found in /solr_gc.log are the following entries:
``2017-04-10T20:38:23.882+0200: 1,954: Total time for which application threads were stopped: 0,0001462 seconds, Stopping threads took: 0,0000557 seconds 2017-04-10T20:38:23.915+0200: 1,987: Total time for which application threads were stopped: 0,0001391 seconds, Stopping threads took: 0,0000426 seconds {Heap before GC invocations=0 (full 0): par new generation total 170688K, used 136576K [0x00000006c0000000, 0x00000006cc800000, 0x00000006cc800000) eden space 136576K, 100% used [0x00000006c0000000, 0x00000006c8560000, 0x00000006c8560000) from space 34112K, 0% used [0x00000006c8560000, 0x00000006c8560000, 0x00000006ca6b0000) to space 34112K, 0% used [0x00000006ca6b0000, 0x00000006ca6b0000, 0x00000006cc800000) concurrent mark-sweep generation total 3989504K, used 0K [0x00000006cc800000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 22526K, capacity 23020K, committed 23212K, reserved 1069056K class space used 2669K, capacity 2882K, committed 2892K, reserved 1048576K 2017-04-10T20:38:23.930+0200: 2,002: [GC (Allocation Failure) 2017-04-10T20:38:23.930+0200: 2,002: [ParNew Desired survivor size 17465344 bytes, new threshold 15 (max 15)
Can you disable Tesseract and start an index again ?
do you know how can i disable tesseract or i have to deinstall it?
i have testet the following Parameters for solr:
-Xms4096M -Xmx4096M -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops -XX:NewSize=200m -XX:MaxNewSize=200m -XX:+OptimizeStringConcat -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:NewRatio=3 -XX:SurvivorRatio=4
i think it runs a little bit better (faster) - but not optimal yet
i think it is easier to deinstall tesseract than edit a .jar file
hello daita i have removed tesseract but nothing changed the problem is the same
As first everything goes very well It takes a few minutes until he has determined the files Then he starts with "preparing" that runs very fast Then he starts with "scanning" at the beginning he creates about 40 documents per minute After a few minutes, he creates only 30 documents Then only 20 documents, etc.
Hello daita i have solved the problem! I have migrated from Owncloud 9.1.4 to nextcloud 10 and then nextcloud 11
Then i started the indexing again
Now the indexing is very fast
Please use Full text search instead of Nextant.
We have a dedicated server with Intel Xeon 2.7 GHz processor and 16 GB of RAM. Our Owncloud installation contains 50,000 documents, about 900 users and about 30 groups.
Most of the documents are shared by the user admin for all users or only certain groups.
I have already started a first indexing with Nextant last week.
At the beginning the indexing speed was still quite fast. After about an hour, speed was getting slower.
At the moment the indexing has been running with user admin since last week.
/admin/files 18811/ 40404 [=============>--------------] 46% (13:16:58) [scanning] Solr memory: 3.1 GB (%53.4) 2 documents extracted in the last minute. Last commit took 23337ms
I noticed that the php process has a CPU utilization of permanently 100 percent.
top - 20:13:21 up 16 days, 3:42, 3 users, load average: 1,16, 1,15, 1,15 Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombie %Cpu(s): 24,8 us, 1,0 sy, 0,0 ni, 73,7 id, 0,2 wa, 0,0 hi, 0,2 si, 0,0 st KiB Mem: 16471568 total, 11925064 used, 4546504 free, 302176 buffers KiB Swap: 33551356 total, 27472 used, 33523884 free. 3604208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27791 www-data 20 0 2565068 1,640g 54556 R 99,5 10,4 10474:52 php
Has anyone of you a advice where the problem is?
We use Owncloud version 9.1.3 Nextant Search 1.0.6 Solr 6.4.0 Heapsize 6144M
Best regards
Christian.