thinkaurelius / titan

Distributed Graph Database
http://titandb.io
Apache License 2.0
5.25k stars 1.01k forks source link

Using SparkGraphComputer to traverse a titan cluster throws an error #1339

Open sheldonkhall opened 8 years ago

sheldonkhall commented 8 years ago

I have a cluster set up with tinkerpop-3.1.1, titan-1.1.0-SNAPSHOT, spark-1.5.2 and hadoop-2.7.1 and run this script to reproduce an error:

graph = GraphFactory.open("hadoop-gryo.properties")

graph.traversal().V().count()

graph.traversal(computer(SparkGraphComputer)).V().next()

graph = GraphFactory.open("titan-cassandra-test-spark.properties")

graph.traversal().V().count()

graph.traversal(computer(SparkGraphComputer)).V().next()

The last call produces this error:

You must set the initial output address to a Cassandra node with setInputInitialAddress
Display stack trace? [yN] y
java.lang.IllegalStateException: You must set the initial output address to a Cassandra node with setInputInitialAddress
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopElementIterator.<init>(HadoopElementIterator.java:71)
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopVertexIterator.<init>(HadoopVertexIterator.java:36)
    at org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph.vertices(HadoopGraph.java:263)
    at org.apache.tinkerpop.gremlin.process.traversal.step.map.GraphStep.lambda$new$379(GraphStep.java:61)
    at org.apache.tinkerpop.gremlin.process.traversal.step.map.GraphStep.processNextStart(GraphStep.java:123)
    at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:126)
    at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:37)
    at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:157)
    at java_util_Iterator$next.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:117)
    at groovysh_evaluate.run(groovysh_evaluate:3)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.codehaus.groovy.tools.shell.Interpreter.evaluate(Interpreter.groovy:70)
    at org.codehaus.groovy.tools.shell.Groovysh.execute(Groovysh.groovy:187)
    at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:122)
    at org.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:95)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1210)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:124)
    at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:59)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1210)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:83)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:144)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:305)
Caused by: java.lang.UnsupportedOperationException: You must set the initial output address to a Cassandra node with setInputInitialAddress
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.validateConfiguration(AbstractColumnFamilyInputFormat.java:84)
    at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.validateConfiguration(ColumnFamilyInputFormat.java:74)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:122)
    at com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraBinaryInputFormat.getSplits(CassandraBinaryInputFormat.java:48)
    at com.thinkaurelius.titan.hadoop.formats.util.GiraphInputFormat.getSplits(GiraphInputFormat.java:48)
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopElementIterator.<init>(HadoopElementIterator.java:66)
    ... 44 more

Strangely the hadoop-gryo.properties graph (which is admittedly local to the machine I execute on) can perform the required traversals. The error only occurs when I try to execute ANY traversal other than count on a hadoop graph pointing to a titan cluster (I have attached the config at the end). Is this a bug, or am I missing a setting?

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
#gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=/test/output
####################################
# Cassandra Cluster Config         #
####################################
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.cassandra.keyspace=mindmapstest
titanmr.ioformat.conf.storage.hostname=lxd-cluster2-cassandra1,lxd-cluster2-cassandra2,lxd-cluster2-cassandra3
titanmr.ioformat.cf-name=edgestore
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=spark://lxd-cluster2-cassandra1:7077
#spark.master=local[6]
spark.executor.memory=4g
spark.serializer=org.apache.spark.serializer.KryoSerializer
#spark.eventLog.enabled=true
####################################
# Apache Cassandra InputFormat configuration
####################################
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=mindmapstest
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
cassandra.thrift.framed.size_mb=1024
####################################
# Hadoop Cluster configuration     #
####################################
fs.defaultFS=hdfs://lxd-cluster2-cassandra1:9000
pluradj commented 8 years ago

Additional discussion in the comments on http://stackoverflow.com/questions/38787338/using-sparkgraphcomputer-to-traverse-a-titan-cluster-throws-an-error