nextcloud / fulltextsearch

🔍 Core of the full-text search framework for Nextcloud
https://apps.nextcloud.com/apps/fulltextsearch
GNU Affero General Public License v3.0
216 stars 51 forks source link

First indexing has been running for a week and is still not finished #143

Closed critzinger closed 6 years ago

critzinger commented 7 years ago

We have a dedicated server with Intel Xeon 2.7 GHz processor and 16 GB of RAM. Our Owncloud installation contains 50,000 documents, about 900 users and about 30 groups.

Most of the documents are shared by the user admin for all users or only certain groups.

I have already started a first indexing with Nextant last week.

At the beginning the indexing speed was still quite fast. After about an hour, speed was getting slower.

At the moment the indexing has been running with user admin since last week.

/admin/files 18811/ 40404 [=============>--------------] 46% (13:16:58) [scanning] Solr memory: 3.1 GB (%53.4) 2 documents extracted in the last minute. Last commit took 23337ms

I noticed that the php process has a CPU utilization of permanently 100 percent.

top - 20:13:21 up 16 days, 3:42, 3 users, load average: 1,16, 1,15, 1,15 Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombie %Cpu(s): 24,8 us, 1,0 sy, 0,0 ni, 73,7 id, 0,2 wa, 0,0 hi, 0,2 si, 0,0 st KiB Mem: 16471568 total, 11925064 used, 4546504 free, 302176 buffers KiB Swap: 33551356 total, 27472 used, 33523884 free. 3604208 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27791 www-data 20 0 2565068 1,640g 54556 R 99,5 10,4 10474:52 php

Has anyone of you a advice where the problem is?

We use Owncloud version 9.1.3 Nextant Search 1.0.6 Solr 6.4.0 Heapsize 6144M

Best regards

Christian.

ArtificialOwl commented 7 years ago

Can you check the log of your Solr ? You might have more informations on the file it is currently indexing.

critzinger commented 7 years ago

In the file solr_gc.log.1.current are numerous entries:

... 2017-02-23T21:36:45.094+0100: 1303861,253: Total time for which application threads were stopped: 0,0003022 seconds, Stopping threads took: 0,0000627 seconds 2017-02-23T21:36:45.099+0100: 1303861,257: Total time for which application threads were stopped: 0,0001651 seconds, Stopping threads took: 0,0000402 seconds 2017-02-23T21:36:45.099+0100: 1303861,258: Total time for which application threads were stopped: 0,0001103 seconds, Stopping threads took: 0,0000246 seconds 2017-02-23T21:36:45.100+0100: 1303861,258: Total time for which application threads were stopped: 0,0001118 seconds, Stopping threads took: 0,0000287 seconds 2017-02-23T21:36:45.100+0100: 1303861,258: Total time for which application threads were stopped: 0,0001141 seconds, Stopping threads took: 0,0000283 seconds ...

In the file solr.log i can see

2017-02-23 20:43:49.528 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 end_commit_flush 2017-02-23 20:44:17.295 INFO (qtp1355531311-13) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX1.pdf&literal.id=files127948&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignored&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX1/XXX1/XXX1.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.content=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&literal.nextant_mtime=1474441919&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fmap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX1@XXX1.de&omitHeader=true&literal.nextantdeleted=false&capture Attr=true&fmap.input=ignored&fmap.span=ignored_}{add=[files_127948 (1560158045244227584)]} 0 74 2017-02-23 20:44:17.303 INFO (qtp1355531311-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=,score&start=0&rows=10&wt=json} status=0 QTime=4 2017-02-23 20:45:01.464 INFO (qtp1355531311-15) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX2.pdf&literal.id=files26672&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignored&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX2/XXX2/XXX2.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.content=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&literal.nextant_mtime=1431428171&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fmap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX2@XXX2.de&omitHeader=true&literal.nextantdeleted=false&captureAttr=true&fmap.input=ignored&fmap.span=ignored_}{add=[files_26672 (1560158091565072384)]} 0 52 2017-02-23 20:45:01.471 INFO (qtp1355531311-21) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=,score&start=0&rows=10&wt=json} status=0 QTime=5 2017-02-23 20:45:17.295 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 start commit{,optimize=false,openSea rcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2017-02-23 20:45:17.295 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.SolrIndexWriter Calling setCommitData with IW:org.apache. solr.update.SolrIndexWriter@25f4cbb1 2017-02-23 20:45:17.384 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.s.SolrIndexSearcher Opening [Searcher@221d07b2[nextant] rea ltime] 2017-02-23 20:45:17.385 INFO (commitScheduler-10-thread-1) [ x:nextant] o.a.s.u.DirectUpdateHandler2 end_commit_flush 2017-02-23 20:45:44.923 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Bol dMT' 2017-02-23 20:45:44.924 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'ArialMT' 2017-02-23 20:45:44.927 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Bol dItalicMT' 2017-02-23 20:45:44.933 WARN (qtp1355531311-20) [ x:nextant] o.a.p.p.f.PDTrueTypeFont Using fallback font 'LiberationSans' for 'Arial-Ita licMT' 2017-02-23 20:45:44.986 INFO (qtp1355531311-20) [ x:nextant] o.a.s.u.p.LogUpdateProcessorFactory [nextant] webapp=/solr path=/update/ext ract params={literal.nextantextracted=true&fmap.p=ignored&fmap.h2=ignored&fmap.style=ignored&fmap.h3=ignored_&resource.name=XXX3.pdf&literal.id=files100151&fmap.h1=ignored&fmap.link=ignored&fmap.ul=ignored&uprefix=nextantattr&fmap.form=ignor ed&fmap.i=ignored&fmap.media_whitepoint=ignored&fmap.img=ignored_&literal.nextantpath=/admin/files/XXX3/XXX3/XXX3.pdf&fmap.a=ignored&fmap.b=ignored&fmap.table=ignored&fmap.c ontent=textedge&fmap.script=ignored&literal.nextantowner=admin&fmap.html=ignored&fmap.div=ignored_&ignoreTikaException=true&wt=json&lite ral.nextant_mtime=1462777393&json.nl=flat&fmap.media_blackpoint=ignored&extractOnly=false&literal.nextantsource=files&fmap.td=ignored&fm ap.tr=ignored&fmap.li=ignored&literal.nextant_share=XXX3@XXX3.de&omitHeader=true&literal.nextantdeleted=false&captureAttr=t rue&fmap.input=ignored&fmap.span=ignored_}{add=[files_100151 (1560158137183371264)]} 0 113 2017-02-23 20:45:44.996 INFO (qtp1355531311-12) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={q=:&json.n l=flat&omitHeader=true&fl=*,score&start=0&rows=10&wt=json} status=0 QTime=4

ArtificialOwl commented 7 years ago

are this local files ? or from a specific file system that you globally share ? If the file are local and on the same filesystem than owncloud, what are the size of the pdfs ?

critzinger commented 7 years ago

yes all files are local under /var/www/html/owncloud/data the size of the files between a few KBs - approx 5 - 10 MBs

ArtificialOwl commented 7 years ago

I would kill the current indexing process, restart Solr and start the ./occ nextant:index again with the --debug option and see how it goes

Do you have any specific configuration regarding your cloud or solr (like tesseract) ? If it is still slow, add .pdf to the extension filters and start indexing again to see how it react

critzinger commented 7 years ago

The indexing that has been running for a week already uses the --debug parameter

There will be a message every 200 - 300 documents

"" Error # 21 (HTTP request failed, operation timed out after 30000 milliseconds with 0 bytes received) "" Failed to extract document # 140325

This is the only error message that is displayed.

As I said Our Owncloud installation contains 50,000 documents, about 900 users and about 30 groups.

The configuration of solr I have left on standard.

I installed Solr as a service as follows: https://github.com/nextcloud/nextant/wiki/Setup-your-local-Solr-as-a-Service https://github.com/nextcloud/nextant/wiki/And-some-more-...

in "TesseractOCRConfig.properties" i have changed language=eng to language=deu

SOLR_HEAP = "6144m"

You mean ignor .pdf extension?

ArtificialOwl commented 7 years ago

50k document is not an issue, the number of users neither. But can you uninstall Tesseract, restart Solr and start again your index ?

critzinger commented 7 years ago

nothing changed :(

he hangs for a long time by [scanning] (PHP Process runs on 100%) [extracting] i can see only 1 or 2 seconds and then he change back to [scanning]

do you have another advice for me?

ArtificialOwl commented 7 years ago

Can you paste the last 4 lines of a

 ./occ nextant:check

(and confirm all your lines are green)

critzinger commented 7 years ago

Your solr contains 23786 documents :

all lines are green

ArtificialOwl commented 7 years ago

try

 ./occ nextant:optimize

also put 60 seconds as timeout also from the Admin interface.

I am sorry if I can't take more time to help you right now

ArtificialOwl commented 7 years ago

any update on that slow scan ?

critzinger commented 7 years ago

Excuse me i was not in the office for the last two weeks

If i want to set the timeout to 60 seconds and click save He tests the settings All tests are green

The timeout then jumps back to 30 seconds - i also can´t set the timeout to 60 seconds

the ./occ nextant:optimize has nothing changed :(

i have found another problem

when i search in a subfolder outputting the search results with already indexed files the path is wrong in the href=

example:

The result is then in the href = "https://cloud.de/owncloud/remote.php/webdav/public/not_public/test.ppt"

The correct path is https://cloud.de/owncloud/remote.php/webdav/not_public/test.ppt

when i searching on a root folder then the output is okay

critzinger commented 7 years ago

Hello daita I have analyzed the problem still further. I deleted the index again and started to recreate it.

At the beginning, approximately 30 documents / minute were indexed. After 13700 documents, he indexes only 3 documents / minute.

The SOLR database has meanwhile reached a size of 2.5 GB. (13700 documents).

Nextant makes a commit after each document. The next document is scanned only after the commit has been completed.

Can the commit be configured to run after 100 documents?

I think this could speed up the scan.

The larger the database, the longer the commit takes.

ArtificialOwl commented 7 years ago

The commit should be done every n seconds (30) or 1 minute. It shouldn't be that slow after 14k documents. Check the number of segments of your index, and see if your commit is not timing out

critzinger commented 7 years ago

Your solr contains 13175 documents :

/admin/files 13764/ 42472 [=========>------------------] 32% (00:28:16) [scanning] Solr memory: 605.9 MB (%10.3) 3 documents extracted in the last minute. Last commit took 58ms

no timeout

php runs permantly on 100%

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2860 www-data 20 0 1846332 0,993g 59564 R 99,6 6,3 2816:49 php

ArtificialOwl commented 7 years ago

try ./occ nextant:optimize in another terminal.

If it fail, stop the current index, optimize, and start again your index. If you do not use the -- force argument, the index should not re-scan your previously scanned document

critzinger commented 7 years ago

after optimizing:

Your solr contains 13222 documents :

speed is not changed :(

ArtificialOwl commented 7 years ago

This is really strange, you should test to inject documents directly into your Solr, see if it takes that much time. This way we should know if issue is with Nextant/Owncloud or Solr.

critzinger commented 7 years ago

how can i inject the documents directly?

ArtificialOwl commented 7 years ago

have you set up an access to the web interface of your solr ?

critzinger commented 7 years ago

yes

ArtificialOwl commented 7 years ago

open the web interface, select the Core, and check the 'Documents' sub menu If I remember well you can upload file from here

critzinger commented 7 years ago

i have

request-Handler (qt)

Document Type (Dropdown list) possible values:

Document(s) (textfield)

commit within

overwrite

boost

can i define a absolute path like /var/WWW/owncloud/data/admin/files?

unbenannt

ArtificialOwl commented 7 years ago

change the option from Document Type to File Upload

critzinger commented 7 years ago

okay but i can upload only one single file from my local pc?

ArtificialOwl commented 7 years ago

You should get a huge pdf file from the internet, like the Apache Solr Documentation which is 10+MB, and see if it takes a while to be extracted

critzinger commented 7 years ago

okay i got the following error :(

unbenannt

ArtificialOwl commented 7 years ago

I think you have to edit Request-Handler: /update/extract/ or /update/extract

critzinger commented 7 years ago

and now i got the following error :(

unbenannt

ArtificialOwl commented 7 years ago

can you paste the result of ./occ nextant:check --info ?

critzinger commented 7 years ago

array ( 'instant' => true, 'configured' => '1', 'ping' => true, 'solr_url' => 'HIDDEN', 'solr_core' => 'HIDDEN', 'solr_timeout' => '120', 'nextant_version' => '1.0.6', 'index_files' => '1', 'index_files_needed' => '0', 'index_files_max_size' => '40', 'index_files_tree' => '0', 'index_files_nextant_only' => '0', 'index_files_trash' => '0', 'index_files_sharelink' => '0', 'index_files_federated' => '0', 'index_files_external' => '0', 'index_files_encrypted' => '0', 'index_files_filters_text' => '1', 'index_files_filters_pdf' => '1', 'index_files_filters_office' => '1', 'index_files_filters_image' => '1', 'index_files_filters_audio' => '0', 'index_files_filters_extensions' => '.csv', 'current_docs' => 13257, 'current_segments' => 5, 'files_external_app_enabled' => false, 'bookmarks_app_enabled' => false, 'index_bookmarks' => 0, 'index_bookmarks_needed' => '0', 'resource_level' => '4', 'index_live' => '2', 'use_cron' => '0', 'index_delay_min' => '2', 'index_delay_max' => '4', 'index_locked' => '0', 'index_files_last' => '0', 'index_files_last_format' => 'Thu, 01 Jan 1970 00:00:00 +0000', 'index_bookmarks_last' => '0', 'index_bookmarks_last_format' => 'Thu, 01 Jan 1970 00:00:00 +0000', 'ext_sysvmsg_loaded' => true, 'source' => 'check', 'cloud_version' => '9.1.4.2', ) Pinging 127.0.0.1:8983/solr/nextant : ok

Checking Solr schema fields

Cleaning extra fields

Your solr contains 13257 documents :

the Cleaning extra fields comes after manual Import from above

i made ./occ nextant:check --fix

then i made a manual import again (same failure) the error with the cleaning extra fields comes again

ArtificialOwl commented 7 years ago

remove the filters on Image files from the Admin Interface, we never know with tesseract Also, run another ./occ nextant:check to see if every field are fixed after a --fix.

Please try to manually import a simple text file

critzinger commented 7 years ago

a simple Textfile make no Problems

unbenannt

critzinger commented 7 years ago

I have again investigated the problem and i found in /solr_gc.log are the following entries:

``2017-04-10T20:38:23.882+0200: 1,954: Total time for which application threads were stopped: 0,0001462 seconds, Stopping threads took: 0,0000557 seconds 2017-04-10T20:38:23.915+0200: 1,987: Total time for which application threads were stopped: 0,0001391 seconds, Stopping threads took: 0,0000426 seconds {Heap before GC invocations=0 (full 0): par new generation total 170688K, used 136576K [0x00000006c0000000, 0x00000006cc800000, 0x00000006cc800000) eden space 136576K, 100% used [0x00000006c0000000, 0x00000006c8560000, 0x00000006c8560000) from space 34112K, 0% used [0x00000006c8560000, 0x00000006c8560000, 0x00000006ca6b0000) to space 34112K, 0% used [0x00000006ca6b0000, 0x00000006ca6b0000, 0x00000006cc800000) concurrent mark-sweep generation total 3989504K, used 0K [0x00000006cc800000, 0x00000007c0000000, 0x00000007c0000000) Metaspace used 22526K, capacity 23020K, committed 23212K, reserved 1069056K class space used 2669K, capacity 2882K, committed 2892K, reserved 1048576K 2017-04-10T20:38:23.930+0200: 2,002: [GC (Allocation Failure) 2017-04-10T20:38:23.930+0200: 2,002: [ParNew Desired survivor size 17465344 bytes, new threshold 15 (max 15)

ArtificialOwl commented 7 years ago

Can you disable Tesseract and start an index again ?

critzinger commented 7 years ago

do you know how can i disable tesseract or i have to deinstall it?

i have testet the following Parameters for solr:

-Xms4096M -Xmx4096M -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops -XX:NewSize=200m -XX:MaxNewSize=200m -XX:+OptimizeStringConcat -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:NewRatio=3 -XX:SurvivorRatio=4

i think it runs a little bit better (faster) - but not optimal yet

ArtificialOwl commented 7 years ago

i think it is easier to deinstall tesseract than edit a .jar file

critzinger commented 7 years ago

hello daita i have removed tesseract but nothing changed the problem is the same

As first everything goes very well It takes a few minutes until he has determined the files Then he starts with "preparing" that runs very fast Then he starts with "scanning" at the beginning he creates about 40 documents per minute After a few minutes, he creates only 30 documents Then only 20 documents, etc.

critzinger commented 7 years ago

Hello daita i have solved the problem! I have migrated from Owncloud 9.1.4 to nextcloud 10 and then nextcloud 11

Then i started the indexing again

Now the indexing is very fast

ArtificialOwl commented 6 years ago

Please use Full text search instead of Nextant.