opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Performance bottleneck in solr? #130

Closed dbsanfte closed 3 years ago

dbsanfte commented 3 years ago

I've been running some indexing on my laptop with 6 cores available. There are around 2.5 million text files to index. So far I am achieving about 5 files every 10 minutes. This will take ages.

The performance bottleneck seems to be solr. Top shows the following, it is only using one or perhaps two cores at best:

image

It doesn't seem to be particularly I/O bound. In iotop I see 3MBit writes to disk every few seconds but nothing crazy. No idea why it's so slow.

I've tried raising the ram buffer in the solrconfig.xml, raising the auto-commit threshold to 200doc/5mins, enabling soft-commits. No perceptible difference in performance.

Now and then spacy will run and it will finally use 100% of cpu for brief periods, but 90% of the time it just sits at ~20-30% cpu.

When I open Flower I see it is working on multiple documents at once. 6 tasks are 'in progress', corresponding to the number of cores.

Very confused as to where it's bottlenecking honestly.

dbsanfte commented 3 years ago

Okay after typing all that out.. I think I found the problem.

I had sentence segmentation turned on in the /search portal. I realized this as I was tailing the /var/solr/logs/solr.log, it had constant lines about '#sentence'. Now those are gone.

CPU usage looks a lot better:

image

Flower is giving me about 3-5 files per second instead of 3-5 files per 10 minutes. :)