Fix quadratic performance behavior of get_text()

opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

https://opensemanticsearch.org/etl

GNU General Public License v3.0

254 stars 69 forks source link

Fix quadratic performance behavior of get_text() #158

Closed wsldankers closed 2 years ago

wsldankers commented 2 years ago

The way get_text() was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of facets. We've seen cases where ETL processing took days and still wasn't finished when we aborted it.

This patch changes text concatenation to be O(n), which means it will perform reasonably well even for large documents and large numbers of facets. It also tries to reduce string copying a little, which helps with large strings.