Sample for nebula-ngql data source

porscheme commented 1 year ago

General Question

Hi @wey-gu

Per comment on the application.conf file, data source can be nebula-ngql, can you please provide a sample? I want to try this feature.

Thanks

Below is an extract from the application.conf file

data: {
    # data source. optional of nebula,nebula-ngql,csv,json
    source: csv
    # data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
    sink: csv
    # if your algorithm needs weight
    hasWeight: false
  }

wey-gu commented 1 year ago

should be like this, @Nicole00 could you help confirm this will work? if so, I could prepare pr for examples in conf file.

data: {
    # data source. optional of nebula,nebula-ngql,csv,json
    source: nebula-ngql
...
  nebula: {
    read: {
        metaAddress: "127.0.0.1:9559"
        graphAddress: "127.0.0.1:9669"
        space: basketballplayer
        labels: ["follow"]
        weightCols: ["degree"]
        ngql: "MATCH ()-[e:follow]->() RETURN e LIMIT 100000"
    }

porscheme commented 1 year ago

Thanks @wey-gu for the quick reply. It looks like nebula-algorithm doesn't work with string VID, can you confirm? And then I see this, how I convert our string VID to integer using algorithm interface?

For non-integer String data, it is recommended to use the algorithm interface. You can use the dense_rank function of SparkSQL to encode the data as the Long type instead of the String type.

wey-gu commented 1 year ago

Actually, it now supports to do the numerical vid generation and auto-mapping, just add encodeId:true to the algo config, see https://github.com/vesoft-inc/nebula-algorithm/pull/68

porscheme commented 1 year ago

You mean like below?

  algorithm: {
    executeAlgo: node2vec
    node2vec:{
      encodeId:true
       maxIter: 5,
       lr: 0.025,
       dataNumPartition: 15,
       modelNumPartition: 10,
       dim: 9,
       window: 2,
       walkLength: 4,
       numWalks: 10,
       p: 05,
       q: 0.5,
       directed: false,
       degree: 2,
       embSeparate: ",",
       modelPath: "/mnt/data/sparkdata/word2vec"
    }
  }

wey-gu commented 1 year ago

You mean like below?

  algorithm: {
    executeAlgo: node2vec
    node2vec:{
      encodeId:true
       maxIter: 5,
       lr: 0.025,
       dataNumPartition: 15,
       modelNumPartition: 10,
       dim: 9,
       window: 2,
       walkLength: 4,
       numWalks: 10,
       p: 05,
       q: 0.5,
       directed: false,
       degree: 2,
       embSeparate: ",",
       modelPath: "/mnt/data/sparkdata/word2vec"
    }
  }

Yes

porscheme commented 1 year ago

Yes

I'm getting this error, not sure why? Below "0033af94-95f2-ec6d-ac72-f75f4d00622a" is a VID

{"level":"WARN","timestamp":"2023-03-22 04:43:17,806","thread":"main","message":"The jar local:///mnt/spark/work/nebula-algorithm-3.0-SNAPSHOT.jar has been added already. Overwriting of added jars is not supported in the current version."}
{"level":"WARN","timestamp":"2023-03-22 04:43:18,145","thread":"main","message":"returnCols is empty and your result will contain all properties for HAS_CONDITION"}
{"level":"WARN","timestamp":"2023-03-22 04:43:20,948","thread":"Executor task launch worker for task 0","message":"Putting block rdd_6_0 failed due to exception java.lang.NumberFormatException: For input string: "0033af94-95f2-ec6d-ac72-f75f4d00622a"."}
{"level":"WARN","timestamp":"2023-03-22 04:43:20,949","thread":"Executor task launch worker for task 0","message":"Block rdd_6_0 could not be removed as it was not found on disk or in memory"}
{"level":"ERROR","timestamp":"2023-03-22 04:43:20,959","thread":"Executor task launch worker for task 0","message":"Exception in task 0.0 in stage 0.0 (TID 0)"}
java.lang.NumberFormatException: For input string: "0033af94-95f2-ec6d-ac72-f75f4d00622a"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Long.parseLong(Long.java:589)
    at java.lang.Long.parseLong(Long.java:631)
    at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
    at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
    at com.vesoft.nebula.algorithm.utils.NebulaUtil$$anonfun$1.apply(NebulaUtil.scala:29)
    at com.vesoft.nebula.algorithm.utils.NebulaUtil$$anonfun$1.apply(NebulaUtil.scala:25)
    at org.apache.spark.sql.execution.MapElementsExec$$anonfun$7$$anonfun$apply$1.apply(objects.scala:236)
    at org.apache.spark.sql.execution.MapElementsExec$$anonfun$7$$anonfun$apply$1.apply(objects.scala:236)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
    at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
    at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

wey-gu commented 1 year ago

@Nicole00 I think the encodeId:true for the main entry of nebula-algorithm is supported, or it's actually not?

wey-gu commented 1 year ago

And @porscheme you are using the latest version of nebula-algo, right?

porscheme commented 1 year ago

And @porscheme you are using the latest version of nebula-algo, right?

I cloned https://github.com/vesoft-inc/nebula-algorithm few hours ago. Therefore, I'm using latest.

wey-gu commented 1 year ago

oh, now I know, the node2vec is not yet supported for the encodeId, you have to do it yourself to map vid to int for now.

QingZ11 commented 1 year ago

@porscheme Hi, same to the previous issue you created, this issue has been closed due to a lack of updates for a long time. If you have any updates, it's OK to reopen it.

Again, thanks a lot for your contribution anyway 😊

vesoft-inc / nebula-algorithm

Sample for nebula-ngql data source #72