opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Fix quadratic performance behavior of enhance_extract_money #159

Closed wsldankers closed 2 years ago

wsldankers commented 2 years ago

The way enhance_extract_money was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of money figures. We've seen cases where processing took hours.

The problem resided in its use of etl_plugin_core.append() for every money amount found, which deduplicates the list of facets and takes longer and longer with each added value.

This patch changes it to collect all values in a set and then add them in bulk at the end.