Enable Titan to use configured ElasticSearch tokenizers and filters.

BillBaird commented 10 years ago

Titan exposes a subset of ElasticSearch features. ElasticSearch allows customized tokenizers and filters. Proper use of es to index Titan propertykeys would allow Titan to take advantage of these powerful behaviors. Examples would be a native es tokenizer like the pathhierarchy-tokenizer http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer/ or a plugable tokenizer like a phonetic tokenizer to enable phonetic searches http://blog.jessitron.com/2012/04/configuring-soundex-in-elasticsearch.html https://github.com/elasticsearch/elasticsearch-analysis-phonetic

The phonetic search allows for requests to return results that can correct mispellings, and enable "did-you-mean" types of searches. Soon, es will have completion suggesters.
http://www.elasticsearch.org/blog/you-complete-me/

With current es integration, a Titan graph is unable to natively take advantage of these powerful capabilities.

As a suggested approach, Blueprints allows for passing additional parameters to createKeyIndex: https://github.com/tinkerpop/blueprints/blob/master/blueprints-core/src/main/java/com/tinkerpop/blueprints/KeyIndexableGraph.java

A similar approach would be to extend TypeMaker's .indexed to be .indexed(String indexName,Class type,Parameter... parameters) where the es tokenizers and filters could be configured.

Plugable tokenizers would have to be installed first. It would be nice if there were a way of accomplishing this through es configuration, perhaps through a storage.index.search.plugin property. This would be best accomplished in conjunction with issue #343

espeed commented 10 years ago

Since Titan is already using ZooKeeper, wouldn't it be simpler to separate the ElasticSearch code, and instead provide an output interface to Kafka?

See the ElasticSearch Kafka-River Plugin: https://github.com/endgameinc/elasticsearch-river-kafka

Adding a Kafka out would make it easy for people to configure ElasticSearch as needed, and it would allow Titan to feed into multiple backends such as Solr, ElasticSearch, or any other backend system, without having to write custom connectors for each.

Kafka is fast, durable, ordered, and 0.8 is replicated. It's commonly used to feed Storm and Spark so integrating Kafka with Titan would provide a generic way to provide pre-processing and post-processing from/to other systems.

Storm Kafka
Spark: Attaching Input Sources - InputDStreams -- BTW: Ion Stoica and Matai Zaharia's new company Databricks just received $14M in funding to build out the Spark platform.

Example Kafka dataflow... Source: http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/ Example Kafka Dataflow Diagram

List of Kafka clients... https://cwiki.apache.org/confluence/display/KAFKA/Clients

Overview of Kafka's binary protocol... https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

mbroecheler commented 10 years ago

As discussed in https://groups.google.com/forum/#!topic/aureliusgraphs/VGv-RJwt8zI we have added the ability to pass arbitrary Parameters through to the indexing backend.

kevinschumacher commented 10 years ago

Colleagues of mine tackled this solely for ElasticSearch by modifying ElasticSearchIndex.java to support an additional parameter on index creation. This additional parameter gets passed directly to ES. We are able to use it like so:

g.makeKey("name")
    .dataType("String.class")
    .indexed("search", Vertex.class, com.thinkaurelius.titan.core.Parameter.of("tokenizer", "(\\.|\\s|\\@|\\/)+"))
    .make()

This makes ES tokenize on period, space, @ and / instead of its normal tokenizer.

@mbroecheler @dalaro is this something you would consider for a pull request? We sort of hacked it together but can put in a little effort to improve it with your guidance.

mbroecheler commented 10 years ago

This is a good use of the new Parameter arguments for index registration. If you can put this into a generally usable pull request with test coverage, we would greatly appreciate it.

On Fri, Feb 14, 2014 at 8:20 AM, kevinschumacher notifications@github.comwrote:

Colleagues of mine tackled this solely for ElasticSearch by modifying ElasticSearchIndex.java to support an additional parameter on index creation. This additional parameter gets passed directly to ES. We are able to use it like so:

g.makeKey("name") .dataType("String.class") .indexed("search", Vertex.class, com.thinkaurelius.titan.core.Parameter.of("tokenizer", "(.|\s|\@|\/)+")) .make()

@mbroecheler https://github.com/mbroecheler @dalarohttps://github.com/dalarois this something you would consider for a pull request? We sort of hacked it together but can put in a little effort to improve it with your guidance.

Reply to this email directly or view it on GitHubhttps://github.com/thinkaurelius/titan/issues/399#issuecomment-35098346 .

Matthias Broecheler http://www.matthiasb.com

bezalel commented 10 years ago

g.makeKey("my_cjk_search_field").dataType(String.class).indexed("search", Vertex.class, com.thinkaurelius.titan.core.Parameter.of("analyzer", "cjk")).make();

==> this does not work for me.. I want to set different (custom) analyzer for each property.. Any update? this is really important for me and my customer :D

kevinschumacher commented 10 years ago

Unfortunately I never got around to cleaning up the code and submitting the pull request. As far as I know that functionality doesn't exist in Titan yet (at least not in the 0.4.x series)

Sent from my mobile device

On Jul 22, 2014, at 10:42 PM, bezalel notifications@github.com wrote:

g.makeKey("my_cjk_search_field").dataType(String.class).indexed("search", Vertex.class, com.thinkaurelius.titan.core.Parameter.of("analyzer", "cjk")).make();

==> this does not work for me.. I want to set different (custom) analyzer for each property.. Any update? this is really important for me and my customer :D

— Reply to this email directly or view it on GitHub.

bezalel commented 10 years ago

@kevinschumacher Could you please share titan jar file for Titan 0.4.4? :D my email is bezalel.dev@gmail.com

thinkaurelius / titan

Enable Titan to use configured ElasticSearch tokenizers and filters. #399