opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Ability to throttle overall ETL process? #108

Closed NetwarSystem closed 2 years ago

NetwarSystem commented 4 years ago

My work involves integrating Open Semantic Search and Atlassian products in what I call the Team Investigative Environment. The systems in use today are older HP workstations with 12 core Xeon, 96 gig of ram, and a couple terabytes of mirrored drives. There are multiple VirtualBox VMs on the system:

The Atlassian and Spiderfoot VMs are happy with four cores and a little bit of memory. The OSS VM uses 100% of every core I give it when loading PDFs. I have 5,300+ PDFs, about 12.5 gig total, and this is just a convenient sample size. Think court cases and merger & acquisition due diligence - we get lots of files.

I am going to use CPU affinity to control this for the moment, but it would be nice if there were a way to throttle the overall ETL process, say by limiting simultaneous jobs. This would be friendlier for those who are picking the OSS OVAs because they are less familiar with Linux management.

NetwarSystem commented 4 years ago

Python doesn't offer a way to employ CPU affinity, I tried a variety of strategies, and modifying the opensemanticetl.service file in /etc/systemd/system did the trick.

ExecStart=/usr/bin/taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 /usr/bin/etl_tasks

The host system involved is a twelve core Xeon, the VM was given 22 cores in VirtualBox. Limiting it to using 12 total with taskset results in a steady 55% utilization rate. I have the system working on several thousand files, it had been pinned at 100% for maybe 36 hours. This is good, now it can be used for search while doing the ETL processing.

Mandalka commented 4 years ago

There are multiple services which needs to be throttled beside the Open Semantic ETL tasks:

Tesseract (which works parallel so even if one ETL task running multiple CPUs used) Apache Tika (which starts tesseract, too if OCR enabled) SpaCy-Services for NER since Spacy multithreading too

I'll try to add some docs but the next days i'm busy.

Aproaches: maximum parallel etl tasks / celery worker (which doesn't help on OCR and NER tasks, see below) If docker, limit container CPUs Linux command "nice"

Mandalka commented 4 years ago

I implemented (will be available in next DEB package next days) a config of max parallel ETL tasks by environment variable OPEN_SEMANTIC_ETL_CONCURRENCY which else the default is count of CPUs and will set Tesseract threads to 1 per ETL task.

Mandalka commented 4 years ago

I disabled Tesseract multithreading (in running ETL tasks, which run parallel, so multiple tesseract runs parallel (one per ETL task), but not every of this ETL tasks with multiple parallel tesseract threads) in default config for Tika-Server by https://github.com/opensemanticsearch/tika-server.deb/commit/9d6cbb9b3773ad03fceeaabef5a63766e4271230 and for OCR of PDF by https://github.com/opensemanticsearch/open-semantic-etl/commit/f01c07b7f36984d4190fcdf916d6bcbade92a0fe so the maximum of parallel Tesseract threads is same like parallel ETL tasks (which you can finetune by limit CPUs to docker container or by ETL environment variable OPEN_SEMANTIC_ETL_CONCURRENCY.

Will be available as new package and VM tomorrow.

NetwarSystem commented 4 years ago

This is excellent news - I will test it as soon as I get a minute free.

Mandalka commented 4 years ago

The deb packages and vms will be available on monday (changes yet only at git).

NetwarSystem commented 3 years ago

Just coming back around to this now. It's been a fairly trying year and a half here in the states. Will get this tested with the latest - this is what I'm running now.

open-semantic-search-vm_21.01.17.ova