Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
The way enhance_extract_money was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of money figures. We've seen cases where processing took hours.
The problem resided in its use of etl_plugin_core.append() for every money amount found, which deduplicates the list of facets and takes longer and longer with each added value.
This patch changes it to collect all values in a set and then add them in bulk at the end.
The way enhance_extract_money was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of money figures. We've seen cases where processing took hours.
The problem resided in its use of
etl_plugin_core.append()
for every money amount found, which deduplicates the list of facets and takes longer and longer with each added value.This patch changes it to collect all values in a
set
and then add them in bulk at the end.