ldbc / ldbc_finbench_datagen

Apache License 2.0
8 stars 7 forks source link

Support ScaleFactor 100 and 1000 #93

Open ChrizZz110 opened 3 weeks ago

ChrizZz110 commented 3 weeks ago

Hi, thanks for working on this data generator. We are using the generated FinBench datasets for our research and would kindly ask to support larger SFs in the generator than the currently supported factor 10. Especially for systems focussing on large-scale graphs, this would be a great extension.

qishipengqsp commented 3 weeks ago

Hi, thanks for the feedback. I have been working on this extension on scalability already. It only can generate SF10 in v0.1.0. Currently I have extended it to sf30, working on the SF100 now.

I am not quite familiar on Spark application optimization, but good news I am moving forward step by step. Hopefully it would support SF300 in the next few weeks.

qishipengqsp commented 3 weeks ago

Collaboration is welcome if you are an Spark expert. :)

ChrizZz110 commented 2 weeks ago

Thanks for the reply @qishipengqsp , happy to hear that larger sfs are already work in progress.

Unfortunately, I'm not an Spark expert, but have some more knowledge in Flink, if this might help. If you want, I can have a look. Is this the branch you are currently working on: https://github.com/ldbc/ldbc_finbench_datagen/tree/sf100 ?

qishipengqsp commented 3 days ago

@ChrizZz110 Thanks for your help and apologize for this late response. Just come back from the LDBC 18th TUC, and start to catch up these thing I left behind.

Yes. I am working on that branch, but it is not much different from the main branch. I just created this branch for SF10 parameters controlling the generation process.

Currently, I am stuck in this error:

Exception in thread "main" java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2477)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911)
    at ldbc.finbench.datagen.generation.generators.ActivityGenerator.signInEvent(ActivityGenerator.scala:134)
    at ldbc.finbench.datagen.generation.ActivitySimulator.simulate(ActivitySimulator.scala:79)
    at ldbc.finbench.datagen.generation.GenerationStage$.$anonfun$run$1(GenerationStage.scala:59)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at ldbc.finbench.datagen.util.SparkUI$.job(SparkUI.scala:12)
    at ldbc.finbench.datagen.generation.GenerationStage$.run(GenerationStage.scala:55)
    at ldbc.finbench.datagen.LdbcDatagen$.run(LdbcDatagen.scala:131)
    at ldbc.finbench.datagen.LdbcDatagen$.main(LdbcDatagen.scala:120)
    at ldbc.finbench.datagen.LdbcDatagen.main(LdbcDatagen.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)