Support for Tika as an ElasticSearch plugin?

abias commented 7 years ago

Hi,

I am currently looking into search_elastic to run it as an alternativ to search_solr, mainly because Elasticsearch seems to be easier to install and run on our RHEL 7 systems.

I have seen that you recommend running Tika for file indexing as a standalone application (see https://github.com/catalyst/moodle-search_elastic#tika-setup). However, there are no rpm packages for Tika out there as far as I see and fiddling with a manually configured service for Tika can be daunting.

On the other hand, you also write that there are Elasticsearch plugins for Tika. I have found https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html and would like to ask:

Is this plugin the Tika plugin you are mentioning?
Is search_elastic able to work with Tika as an Elasticsearch plugin or does it really need Tika as a standalone application?

Thanks, Alex

mattporritt commented 7 years ago

Yes, the link you provided is for the correct one for the elasticsearch ingest plugin. No, the current elasticsearch plugin will not work with it. It expects/needs Tika to be a standalone service.

Internally we use Tika via a Docker container. This makes it easy to scale and share resources. It is also easy to setup and run for test and dev. We don't RHEL internally so I can't help you with an RPM.

I'll add this to the backlog as a feature request. If you are interested in sponsoring it please let me know.

brendanheywood commented 7 years ago

Just throwing a random idea around that I briefly mentioned to @mattporritt. The file indexing component could be abstracted as a subplugin of the search plugin so that you can select which indexer(s) to use and in what order, and we could refactor tiki into one, and the elastic indexer as another.

It occurs to me that we could also leverage the file converter api to do something similar, so you could configure say unoconv to do the text conversion from pdf to txt and then you could write a file converter for tiki which could also do the AI image processing etc. The one downside of this is that the search api landing in 3.1 while the file converter api only landed in 3.3. So perhaps search could just use both, try and use whatever subplugins it has which are turned on, and then if no converter for that file type is found, but their is an enabled converted use it.

But maybe this is just over thinking it all :)

mattporritt commented 7 years ago

I need to do some more research here. Sub plugins for search may require a core patch, but I need to confirm.

I think overall we need to go to a sub plugin architecture here. When we do it will be a compromise of technical debt vs effort now. I haven't decided which yet.

I think a good approach here is to have a range of plugins that declare to the parent which file mine types they support. Then file indexing is handled by the appropriate enabled sub plugin.

Eventually I would like to see the following sub plugins as a start:

no op - plain text files don't need conversion, just need to get content as a string. Code already exists in the plugin to do this, just need to refactor it out
external Tika - extract content of files using an external Tika service. Code already exists in the plugin to do this, just need to refactor it out
Aws rekognition - image file content is extracted using AWS AI as a service. Code already exists in the plugin to do this, just need to refactor it out
Elasticsearch Tika - file content is converted to base64 ready for sending to Elasticsearch instances with the ingest plugin. TODO
Moodle doc converter - files are sent to the core Moodle conversion API for text extraction. TODO

I might change my mind when I look into it more, but this my current thinking

mattporritt commented 7 years ago

I've scratched sub plugins as an idea. Instead as has been suggested I'm going to implement support for the document converter API in Moodle. I've created issue #25 for this.

As 3.1 is an LTS release and 3.2 still has about a year of support left. I'm going to implement this functionality using the current plugin architecture. Eventually there will be a pre 3.3 version of this plugin that does things using the current architecture and 3.3 and above version that uses the file converter API.

mattporritt commented 7 years ago

I did some more research about the available ingest Elasticsearch plugin and how ingesting files using Elasticsearch worked. I had assumed based on my initial investigation that you simply provided the file as a base64 encoded string along with the rest of the document data to be added to the index. This is not the case.

Implementing support for this method of file indexing would mean changes to the way the initial index is created, how documents are stored and how results are returned. This is a level of change that I am not willing to undertake.

Files stored using the ingest plugin are stored as separate records inside Elasticsearch, these are separate to the rest of the data stored relating to that file. This would mean I would need to figure out a way to link the metadata about a file (such as the activity it relates to) to the actual file data. While not breaking existing functionality

Also ingesting files this way seems very inefficient. Using the ingest plugin files are stored in the Elasticsearch internal database as base64 encoded strings. Base64 on average takes up 30% more space than the original binary. This is in addition to the content extracted from the file which is also stored in Elasticsearch. Using the current implementation with Tika as an external service means that that you are not making an additional copy of the file.

The Elasticsearch documentation also states: “Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node.” Adding a dedicated node for ingest would make the Elasticsearch setup process more complicated.

The architecture changes required here are too great. I will still work towards modifying my Moodle plugin to use the File conversion API as this still seems like a good way forward.

mattporritt / moodle-search_elastic

Support for Tika as an ElasticSearch plugin? #22