imri / mizo

Super-fast Spark RDD for Titan Graph Database on HBase
Apache License 2.0
25 stars 10 forks source link

How can I get Titan vertices from HBase directly using Apache Spark? #1

Open ChaohsinChan opened 7 years ago

ChaohsinChan commented 7 years ago

I am running Titan 1.0 with HBase 1.0.3 backend.I want to get the Titan vertices from HBase directly using Apache Spark 1.6.1 ,can you give me some advice? Thanks

imri commented 7 years ago

Hey,

You can run the following code in order to retrieve the vertices. For example, let's count how many vertices you have on your graph.

import mizo.rdd.MizoBuilder;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

public class MizoVerticesCounter {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("Mizo Vertices Counter")
                .setMaster("local[1]")
                .set("spark.executor.memory", "4g")
                .set("spark.executor.cores", "1")
                .set("spark.rpc.askTimeout", "1000000")
                .set("spark.rpc.frameSize", "1000000")
                .set("spark.network.timeout", "1000000")
                .set("spark.rdd.compress", "true")
                .set("spark.core.connection.ack.wait.timeout", "6000")
                .set("spark.driver.maxResultSize", "100m")
                .set("spark.task.maxFailures", "20")
                .set("spark.shuffle.io.maxRetries", "20");

        SparkContext sc = new SparkContext(conf);

        long count = new MizoBuilder()
                .titanConfigPath("titan-graph.properties")
                .regionDirectoriesPath("hdfs://my-graph/*/e") // HDFS path to your HBase Table
                .parseInEdges(v -> false)
                .verticesRDD(sc)
                .toJavaRDD()
                .count(); // total number of vertices in your graph

        System.out.println("Vertices count is: " + count);
    }
}

Change 'hdfs://my-graph/*/e' to the HDFS path of your HBase Table.

Let me know if you have any further questions.

ChaohsinChan commented 7 years ago

Thank you for your reply. I have two suggestions. First, whether we can get the HDFS path through the HBase interface, which is more convenient to use, usually, we only know that HBase table name and it's configurations. Second, whether the project can be converted to Maven management, which can also be developed inside the Eclipse. For those who are not familiar with Idea, it would take a long time to build up the development environment.

imri commented 7 years ago

Thanks for your suggestions -

Regarding the Table name, I generally prefer not to rely on Hadoop config files, but rather specify paths directly.

Regarding Maven - good advice, I will switch to Maven and reupload soon.

Did you manage to run the code eventually?

ChaohsinChan commented 7 years ago

I am not very familiar with Idea, so until now has not set up a good development environment. Can you give me some advice?

imri commented 7 years ago

You only have to open the root directory in IntelliJ, then go to MizoEdgesCounter, tight click and debug.

ChaohsinChan commented 7 years ago

When I import a project to Idea, choosing to create a project from an existing source will prompt me that the project file already exists and that other errors will occur when I choose to overwrite it. I do not know why.But if I choose Import a project from an existing model,only Eclipse,Gradle,Maven can choose.So I still did not succeed.

imri commented 7 years ago

Use Open rather than Import Project, should work

On Tue, 13 Dec 2016 at 10:11 ChaohsinChan notifications@github.com wrote:

When I import a project to Idea, choosing to create a project from an existing source will prompt me that the project file already exists and that other errors will occur when I choose to overwrite it. I do not know why.But if I choose Import a project from an existing model,only Eclipse,Gradle,Maven can choose.So I still did not succeed.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266673431, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_2-55frNsHvxQZ1LOLW2AgMADmUwks5rHlM4gaJpZM4LKZ6v .

imri commented 7 years ago

Try using File > Open and choose the project iml file

ChaohsinChan commented 7 years ago

Thank you for your suggestion, I am left with a last problem. Module mizo-core: invalid item 'com.google.guava:guava:19.0' in the dependencies list Module mizo-core: invalid item 'com.thinkaurelius.titan:titan-core:1.0.0' in the dependencies list How do I introduce these dependencies? And Hbase and Spark without these dependency problems.

imri commented 7 years ago

These dependencies should come from Maven. I see that the POMs are not included in the repo, I will add them in 12 hours.

ChaohsinChan commented 7 years ago

OK,Thanks. I find the files titan-graph.properties and log4j.properties are also missing,you can add them together.

imri commented 7 years ago

You can omit the log4j properties file, and graph.properties is your Titan properties file. On Tue, 13 Dec 2016 at 11:39 ChaohsinChan notifications@github.com wrote:

OK,Thanks. I find the files titan-graph.properties and log4j.properties are also missing,you can add them together.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266691483, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_26VYYfQiomQD89rSVMIXt9rib9gks5rHmfCgaJpZM4LKZ6v .

ChaohsinChan commented 7 years ago

I find an error: Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.

I suspect that this problem is about the config titan-graph.properties,can you show your config to me?

imri commented 7 years ago

Send me your properties file On Tue, 13 Dec 2016 at 12:42 ChaohsinChan notifications@github.com wrote:

I find an error: Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.

I suspect that this problem is about the config titan-graph.properties,can you show your config to me?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266705734, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_7W_e2N-RVHHSSjJsE4uJ5r8wE6Wks5rHnatgaJpZM4LKZ6v .

ChaohsinChan commented 7 years ago

storage.backend=hbase storage.hostname=hlg-3p163-wangyongzhi,hlg-3p190-wangyongzhi,hlg-3p166-wangyongzhi storage.hbase.table=titandb storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure cache.db-cache = true cache.db-cache-clean-wait = 20 cache.db-cache-time = 180000 cache.db-cache-size = 0.5 index.search.backend=elasticsearch index.search.hostname=127.0.0.1 index.search.elasticsearch.client-only=true

ChaohsinChan commented 7 years ago

I wonder this configuration is not right. I just copy them from Titan configuration .

imri commented 7 years ago

Add: storage.hbase.compat-class = com.thinkaurelius.titan.diskstorage.hbase.HBaseCompat1_0

ChaohsinChan commented 7 years ago

It does not work,should I need other dependencies?

imri commented 7 years ago

Let me build it myself and I will upload it as a complete Maven project. Will update you soon.

ChaohsinChan commented 7 years ago

OK,thanks

ChaohsinChan commented 7 years ago

All the problems are solved by me, and now to the last step, but there was a mistake:

Exception in thread "main" java.lang.ClassCastException: com.thinkaurelius.titan.graphdb.types.VertexLabelVertex cannot be cast to com.thinkaurelius.titan.graphdb.internal.InternalRelationType at mizo.rdd.MizoRDD.lambda$loadRelationTypes$3(MizoRDD.java:146) at java.lang.Iterable.forEach(Iterable.java:75)

Would you give me some advice?

imri commented 7 years ago

Please send me your code On Wed, 14 Dec 2016 at 9:36 ChaohsinChan notifications@github.com wrote:

All the problems are solved by me, and now to the last step, but there was a mistake:

Exception in thread "main" java.lang.ClassCastException: com.thinkaurelius.titan.graphdb.types.VertexLabelVertex cannot be cast to com.thinkaurelius.titan.graphdb.internal.InternalRelationType at mizo.rdd.MizoRDD.lambda$loadRelationTypes$3(MizoRDD.java:146) at java.lang.Iterable.forEach(Iterable.java:75)

Would you give me some advice?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266964060, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_5DtwKfzfWdJCHt2fq0DF46VkT-Zks5rH5xsgaJpZM4LKZ6v .

ChaohsinChan commented 7 years ago

public class MizoEdgesCounter { public static void main(String[] args) { System.setProperty("hadoop.home.dir", "C:\F盘\hadoop-2.6.0.tar\hadoop-2.6.0\hadoop-2.6.0"); SparkConf conf = new SparkConf() .setAppName("Mizo Edges Counter") .setMaster("local[1]") .set("spark.executor.memory", "4g") .set("spark.executor.cores", "1") .set("spark.rpc.askTimeout", "1000000") .set("spark.rpc.frameSize", "1000000") .set("spark.network.timeout", "1000000") .set("spark.rdd.compress", "true") .set("spark.core.connection.ack.wait.timeout", "6000") .set("spark.driver.maxResultSize", "100m") .set("spark.task.maxFailures", "20") .set("spark.shuffle.io.maxRetries", "20");

    SparkContext sc = new SparkContext(conf);

    long count = new MizoBuilder()
            .logConfigPath("C:\\ideapluin\\mizo-master\\mizo-master\\target\\test\\mizo-rdd\\log4j.properties")
            .titanConfigPath("C:\\ideapluin\\mizo-master\\mizo-master\\target\\test\\mizo-rdd\\titan-graph.properties")
            .regionDirectoriesPath("hdfs://hlg-3p163-wangyongzhi:8020/apps/hbase/data/data/default/titandb6/8f68e1d6f9d35a4683e1a4c264cd669f/e")
            .parseInEdges(v -> false)
            .edgesRDD(sc)
            .toJavaRDD()
            .count();

    System.out.println("Edges count is: " + count);
}

}

ChaohsinChan commented 7 years ago

I did not modify your code. This error occured here: ` protected static HashMap<Long, InternalRelationType> loadRelationTypes(String titanConfigPath) { TitanGraph g = TitanFactory.open(titanConfigPath); StandardTitanTx tx = (StandardTitanTx)g.newTransaction();

    HashMap<Long, InternalRelationType> relations = Maps.newHashMap();

    tx.query()
            .has(BaseKey.SchemaCategory, Contain.IN, Lists.newArrayList(TitanSchemaCategory.values()))
            .vertices()
            .forEach(v -> relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v)));

    g.close();

    return relations;
}`
imri commented 7 years ago

On MizoRDD loadRelationTypes, change the forEach to:

.foraEach(v -> { if (v instanceof InternalRelationType) { relation.put(...) } }); On Wed, 14 Dec 2016 at 9:40 ChaohsinChan notifications@github.com wrote:

I did not modify your code. ` protected static HashMap<Long, InternalRelationType> loadRelationTypes(String titanConfigPath) { TitanGraph g = TitanFactory.open(titanConfigPath); StandardTitanTx tx = (StandardTitanTx)g.newTransaction();

HashMap<Long, InternalRelationType> relations = Maps.newHashMap();

tx.query()
        .has(BaseKey.SchemaCategory, Contain.IN, Lists.newArrayList(TitanSchemaCategory.values()))
        .vertices()
        .forEach(v -> relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v)));

g.close();

return relations;

}`

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266964700, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_5YQ2XB81f7tLf7Yjjqg0oq9vu2hks5rH51mgaJpZM4LKZ6v .

imri commented 7 years ago

Modify the code as i mentioned, should solve this problem On Wed, 14 Dec 2016 at 9:45 ChaohsinChan notifications@github.com wrote:

I did not modify your code. This error occured here:

` protected static HashMap<Long, InternalRelationType> loadRelationTypes(String titanConfigPath) { TitanGraph g = TitanFactory.open(titanConfigPath); StandardTitanTx tx = (StandardTitanTx)g.newTransaction();

HashMap<Long, InternalRelationType> relations = Maps.newHashMap();

tx.query()
        .has(BaseKey.SchemaCategory, Contain.IN, Lists.newArrayList(TitanSchemaCategory.values()))
        .vertices()
        .forEach(v -> relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v)));

g.close();

return relations;

}`

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266965499, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_9kFLq2CXhOWk2kJ2C1Zm-vYvSqVks5rH56WgaJpZM4LKZ6v .

ChaohsinChan commented 7 years ago

The problem above was solved, but there was aslo a mistake: java.lang.IllegalArgumentException: Invalid ASCII encoding offset: 625 at com.thinkaurelius.titan.graphdb.database.serialize.attribute.StringSerializer.read(StringSerializer.java:105) at mizo.hbase.MizoTitanHBaseRelationParser.readPropertyValue(MizoTitanHBaseRelationParser.java:179) at mizo.iterators.MizoBaseRelationsIterator.handleProperty(MizoBaseRelationsIterator.java:87) at mizo.iterators.MizoBaseRelationsIterator.getEdgeOrNull(MizoBaseRelationsIterator.java:46)

imri commented 7 years ago

Ok I will check it later today.

Shortly - Mizo was never tested on a graph with vertex labels, so thats probably the issue.

Can you describe your Titan schema? Which edges do you have, their types etc? On Wed, 14 Dec 2016 at 10:05 ChaohsinChan notifications@github.com wrote:

The problem above was solved, but there was aslo a mistake: java.lang.IllegalArgumentException: Invalid ASCII encoding offset: 625 at com.thinkaurelius.titan.graphdb.database.serialize.attribute.StringSerializer.read(StringSerializer.java:105) at mizo.hbase.MizoTitanHBaseRelationParser.readPropertyValue(MizoTitanHBaseRelationParser.java:179) at mizo.iterators.MizoBaseRelationsIterator.handleProperty(MizoBaseRelationsIterator.java:87) at mizo.iterators.MizoBaseRelationsIterator.getEdgeOrNull(MizoBaseRelationsIterator.java:46)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266969042, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy_1wqv8jl7s-enhYRyVplAdd2be4vks5rH6NmgaJpZM4LKZ6v .

ChaohsinChan commented 7 years ago

I use the Titan example,Graph Of The Gods,you can see here http://s3.thinkaurelius.com/docs/titan/1.0.0/getting-started.html

imri commented 7 years ago

Ok, I will check it soon. On Wed, 14 Dec 2016 at 10:12 ChaohsinChan notifications@github.com wrote:

I use the Titan example,Graph Of The Gods,you can see here http://s3.thinkaurelius.com/docs/titan/1.0.0/getting-started.html http://url

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/imri/mizo/issues/1#issuecomment-266970308, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPy__uk9q-Vl1In_Yciqb7KeCX00rFJks5rH6UHgaJpZM4LKZ6v .

imri commented 7 years ago

Fixed the bug - checked using the Graph of the Gods, works :) Also updated the project to use Maven

Let me know if it works for you.

ChaohsinChan commented 7 years ago

There was aslo a mistake,how can I resolve it? Should be guava version of the conflict

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedTime(Ljava/util/concurrent/TimeUnit;)J at com.google.common.cache.LocalCache$LoadingValueReference.elapsedNanos(LocalCache.java:3600) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2412) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2373) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2335) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2250) at com.google.common.cache.LocalCache.get(LocalCache.java:3985) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4788) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6$6.call(StandardTitanTx.java:1244) at com.thinkaurelius.titan.graphdb.query.QueryUtil.processIntersectingRetrievals(QueryUtil.java:268) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6.execute(StandardTitanTx.java:1258) at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$6.execute(StandardTitanTx.java:1126) at com.thinkaurelius.titan.graphdb.query.QueryProcessor$LimitAdjustingIterator.getNewIterator(QueryProcessor.java:198) at com.thinkaurelius.titan.graphdb.query.LimitAdjustingIterator.hasNext(LimitAdjustingIterator.java:54) at com.thinkaurelius.titan.graphdb.query.ResultSetIterator.nextInternal(ResultSetIterator.java:40) at com.thinkaurelius.titan.graphdb.query.ResultSetIterator.<init>(ResultSetIterator.java:30) at com.thinkaurelius.titan.graphdb.query.QueryProcessor.iterator(QueryProcessor.java:57) at com.google.common.collect.Iterables$7.iterator(Iterables.java:613) at java.lang.Iterable.forEach(Iterable.java:74) at mizo.rdd.MizoRDD.loadRelationTypes(MizoRDD.java:149) at mizo.rdd.MizoRDD.<init>(MizoRDD.java:71) at mizo.rdd.MizoBuilder$1.<init>(MizoBuilder.java:53) at mizo.rdd.MizoBuilder.edgesRDD(MizoBuilder.java:53) at MizoEdgesCounter.main(MizoEdgesCounter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

imri commented 7 years ago

This error is caused because there is a mismatch between Titan and other components version of Guava.

I succeed to run the code for HBase 1.0.3 -- try to checkout the code into a new directory and run it from there, without any modifications. Should work

ChaohsinChan commented 7 years ago

When I run it without any modifications,there was an error here: Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:47) at com.thinkaurelius.titan.diskstorage.Backend.getImplementationClass(Backend.java:473) at com.thinkaurelius.titan.diskstorage.Backend.getStorageManager(Backend.java:407) at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.(GraphDatabaseConfiguration.java:1320) at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94) at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:62) at mizo.rdd.MizoRDD.loadRelationTypes(MizoRDD.java:141)

imri commented 7 years ago

Pushed an update for fixing this, try now - working for me

ChaohsinChan commented 7 years ago

I get result,but there was an error here when the job completed:

27490 [main] INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: count at MizoEdgesCounter.java:34, took 2.037018 s Edges count is: 34

27871 [DestroyJavaVM] WARN com.thinkaurelius.titan.graphdb.database.StandardTitanGraph - Unable to remove graph instance uniqueid c0a8adc387204-DE0018-PC1 com.thinkaurelius.titan.core.TitanException: Could not execute operation due to backend exception at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:44) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:144) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:141) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.set(KCVSConfiguration.java:118) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.remove(KCVSConfiguration.java:159) at com.thinkaurelius.titan.diskstorage.configuration.ModifiableConfiguration.remove(ModifiableConfiguration.java:42) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.closeInternal(StandardTitanGraph.java:191) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.access$600(StandardTitanGraph.java:78) at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph$ShutdownThread.start(StandardTitanGraph.java:803) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:102) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.shutdown(Shutdown.java:234) Caused by: com.thinkaurelius.titan.diskstorage.PermanentBackendException: Permanent exception while executing backend operation setConfiguration at com.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:42) ... 13 more Caused by: java.lang.IllegalArgumentException: Connection is null or closed. at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:310) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:712) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:694) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getTable(ConnectionManager.java:532) at com.thinkaurelius.titan.diskstorage.hbase.HConnection1_0.getTable(HConnection1_0.java:22) at com.thinkaurelius.titan.diskstorage.hbase.HBaseStoreManager.mutateMany(HBaseStoreManager.java:424) at com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.mutateMany(HBaseKeyColumnValueStore.java:189) at com.thinkaurelius.titan.diskstorage.hbase.HBaseKeyColumnValueStore.mutate(HBaseKeyColumnValueStore.java:88) at com.thinkaurelius.titan.diskstorage.locking.consistentkey.ExpectedValueCheckingStore.mutate(ExpectedValueCheckingStore.java:65) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:146) at com.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$2.call(KCVSConfiguration.java:141) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:133) at com.thinkaurelius.titan.diskstorage.util.BackendOperation$1.call(BackendOperation.java:147) at com.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:56) ... 14 more

imri commented 7 years ago

I will fix it soon. Did you succeed?

ChaohsinChan commented 7 years ago

Yes! In addition to the above error, I have to get the results, it is not easy!

ChaohsinChan commented 7 years ago

I will traverse all the vertex information soon, check the vertex information is correct or not.

imri commented 7 years ago

Ok keep me updated :)

ChaohsinChan commented 7 years ago

How can I bulk import data to Titan, can you give me some advice? I have 100GB of data. Thanks.

imri commented 7 years ago

Hey, Create a new transaction that uses batches (TitanGraph.buildTransaction().enableBatchLoading().checkExternalVertexExistence(false)), then commit() the transaction every X insertions, for example 50k.

figo1885 commented 7 years ago

Hello imri, Thank you for the great work on mizo. I meet same problems described in the questions in stackoverflow: Q1: http://stackoverflow.com/questions/41121262/reading-a-large-graph-from-titan-on-hbase-into-spark?rq=1
Q2:http://stackoverflow.com/questions/35464538/how-to-process-large-titan-graph-using-spark Until now, i can't find good practice by Titan with Spark for OLAP.
Do you have tried to directly use SparkGraphComputer to do OLAP? do you have any example codes? In the TitanBlueprintsGraph.java file,when override the computer method:

@Override public <C extends GraphComputer> C compute(Class<C> graphComputerClass) throws IllegalArgumentException { if (!graphComputerClass.equals(FulgoraGraphComputer.class)) { throw Graph.Exceptions.graphDoesNotSupportProvidedGraphComputer(graphComputerClass); } else { return (C)compute(); } } So i think,when i create TitanGraph,it don't support SparkGraphComputer, I can only create hadoopgraph by graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties'), how can it do the tranversal of Titan graph DB? I can't find how it scan the HBase tables.
Can you have any example code for SparkGraphComputer work with titan?

Thank you very much.

imri commented 7 years ago

Hey,

This answer might be helpful.

I have used SparkGraphComputer using Titan, but I malfunctions, and is really buggy. In order for this to work, you have to use HadoopGraph (as specified in the answer above), which internally uses an InputFormat to read the graph. Titan's implementation of InputFormat was buggy - first of all, it skips vertices (if you count the number of vertices using the InputFormat, you get a wrong answer). Second, it crashes in some circumstances (for example, an edge that connected vertex to itself). Third, SparkGraphComputer is really really slow - I haven't researched why. To sum up - as far as I'm concerned - SparkGraphComputer is bad.

What are you trying to achieve? Tell me more, maybe we can figure it out using Mizo.

Best regards

figo1885 commented 7 years ago

Thank you very much! So excited that you answered me. (Please ignore my english grammatical errors). Now i am trying to use Titan to store some relation data about users,user follow relation, user's goods for sell (Second hand). And then i want to do some OLAP analyze to do some relation recommend,goods recommend, user cluster divide and so on. For example: Case1: A follow B, B follow C, and maybe A will be interesting with C.
Case2: I want to find why and how users follow another one, if there are any common features.

Now,I have already build my Titan Cluster using HBase + ElasticSearch as backend for OLTP service, and i am trying to build my OLAP environment based on Titan and Spark,but found there is no good document. And even Titan don't support Spark well.

When i found the mizo project, i think maybe i can do OLAP on Spark GraphX. I mean, i only scan my Titan Hbase table for all vertices and edges into Spark, and use Spark GraphX to do the analyze. Is this possible?

Thank you again !

imri commented 7 years ago

So if I get you right, you are willing to expand from a given vertex through multiple hops. Mizo only allows you to expand from a given vertex to its direct edges.

I haven't used GraphX, but as far as I'm concerned, it should be really easy to integrate Mizo with it, since it only expects an RDD of edges, so you can convert Mizo EdgesRDD to a RDD of GraphX edges. I'm not sure what you'll be able to achieve using GraphX, but give it a try.

If you need any help, let me know.

figo1885 commented 7 years ago

Thank you,i will have a try.

figo1885 commented 7 years ago

Hello imri, I have started a spark OLAP task based on Titan &Hbase & Gremlin Spark Computer, But as your experiments, it works very slow, when i have 150 Vertex in the graph,it costs 4 minutes,and when there are 10millon vertex, it cost too long time.
Here it seems stop in readRDD from Titan. image

My Hbase version is 0.94,but i found in mizo, it depends 1.0.2 hbase client. and my Hbase in production envrionment don't allow me directly read HFiles...

I am trying to solve these problems.

PS:I have a questions about using Titan, is is there any way to create the property key first, commit and then later do indexing? Because when i write properties without create index(using eslatic search),it have errors.

figo1885 commented 7 years ago

Hello, I have successfully run the edges and vertices count test case user Mizo! Thank you. I am using hbase 0.98,spark1.5.1 and the Titan's God graph. I still have some questions,the vertices count is not right, there are 17 edges,but the mizo count result eges coutn is 32. it is not 17*2. Then i build a very simple graph, only 3 vertices, And after my test by mizo, it found the vertices count is 10, there are 7 non-related vertices, i think these edges may be index or some internal use vertices in Titan. i think this maybe related with 'Multiple Item Data Model(ref:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html )',because when i scan my the table by Hbase shell, the same rowkey with more values. image

  1. in the MizoRDD.java file, when loading relationType, why these label configured for Vertices ignored. If i need the vertices label info it is impossible.
    protected static HashMap<Long, MizoTitanRelationType> loadRelationTypes(String titanConfigPath)
    {
    ...
                .forEach(v -> {
                    if (v instanceof InternalRelationType)
                        relations.put(v.longId(), new MizoTitanRelationType((InternalRelationType)v));
                });
    }
  2. when i use Hbase v0.98, In the MizoRegionFamilyCellsIterator.java, in the ASC_CELL_COMPARATOR, there are no CellComparator.compareRows and compareTimestamps method. so i changed to them to compareStatic,like follow code.
    private Comparator<Cell> ASC_CELL_COMPARATOR = (left, right) -> {
        int c = CellComparator.compareStatic(left, right);
        if (c != 0) {
            return c;
        } else {
            if (left.getFamilyLength() + left.getQualifierLength() == 0 &&
                    left.getTypeByte() == KeyValue.Type.Minimum.getCode()) {
                return 1;
            } else if (right.getFamilyLength() + right.getQualifierLength() == 0 &&
                    right.getTypeByte() == KeyValue.Type.Minimum.getCode()) {
                return -1;
            } else {
                boolean sameFamilySize = left.getFamilyLength() == right.getFamilyLength();
                if (!sameFamilySize) {
                    return Bytes.compareTo(left.getFamilyArray(), left.getFamilyOffset(), left.getFamilyLength(),
                            right.getFamilyArray(), right.getFamilyOffset(), right.getFamilyLength());
                } else {
                    int diff = CellComparator.compareStatic(left, right);
                    if (diff != 0) {
                        return diff;
                    } else {
                        c = Longs.compare(right.getTimestamp(), left.getTimestamp());
                        if (c != 0) diff=c;
                        //diff = CellComparator.compareTimestamps(right, left); // Different from CellComparator.compare()
                        return diff != 0 ? diff : (255 & right.getTypeByte()) - (255 & left.getTypeByte());
                    }
                }
            }
        }
    };

    i am not quite under this part, why need Creates an ascending-sorted cells iterator, what does Cell mean, it is a properties or Edge in the one row? Any suggested document for me to understand Htable,regionfamily,cell etc. Any suggested document for me to understand Titan datamodule?