Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
The way get_text() was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of facets. We've seen cases where ETL processing took days
and still wasn't finished when we aborted it.
This patch changes text concatenation to be O(n), which means it will perform reasonably well even for large documents and large numbers of facets. It also tries to reduce string copying a little, which helps with large strings.
The way get_text() was implemented it had O(n²) performance, which lead to very poor performance with large documents with large numbers of facets. We've seen cases where ETL processing took days and still wasn't finished when we aborted it.
This patch changes text concatenation to be O(n), which means it will perform reasonably well even for large documents and large numbers of facets. It also tries to reduce string copying a little, which helps with large strings.