opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Neo4j crashed during import #137

Closed NetwarSystem closed 2 years ago

NetwarSystem commented 3 years ago

I had a batch of about 90k PDFs, about 40k are one to two pages and needed OCR, the rest are a mixed bag. I ran this on a virtual machine with 16384mb of ram and 20 cores, took about 24 hours to do the work.

But I just noticed that 10% of the way into the process the Neo4j server crashed. There were console messages about out of memory. I restarted the system with 24576mb of ram as an experiment to see if this will be enough, but it's not clear to me how I would restart the ETL process to just do the Neo4j portion.

Is that even possible? If so, how does one launch it?