kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

SolrCellBuilder appears to fail to configure Tika to extract embedded contents #397

Open tballison opened 9 years ago

tballison commented 9 years ago

To configure Tika to parse embedded documents recursively, you need to set the embedded parser in the parse context. If my reading of SolrCellBuilder is correct, Tika will only pull the contents out of the container document and will miss attachments.

See: https://issues.apache.org/jira/browse/SOLR-7189 and http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3CCAN4YXve24W++MKK1U-n0rp6JKNf-FQB10_ggRw4W4-Xy8dgP-w@mail.gmail.com%3E