opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

SOLR index size #75

Closed YoannMR closed 5 years ago

YoannMR commented 5 years ago

Hi,

I recently moved to the latest version of OSS and noticed that the SOLR index is significantly bigger (at least 2x larger).

I compared the fields in the two versions: while there are many more fields in the new version, I noticed that the content of the document is duplicated 4 times (in "text", "text_txt_en", "content_txt" and "content_txt_txt_en").

Is there a need for this duplication? The previous version only had 'content' with the document text.

Thanks for your help!

opensemanticsearch commented 5 years ago

Since stemmed variants of default catch all field are now stored for thesaurus recommender and so can be used for highlighting of not stemmed fields like content_txt, too there is no need for s stemmed content fields for highlighting stemmed variants in search UI anymore, so newest releases use only an second language specific default/catch all field (for example text_txt_en) for stemming and ETL does not copy each other field to a stemmed variant anymore.