ldbc / ldbc_snb_datagen_spark

Synthetic graph generator for the LDBC Social Network Benchmark, running on Spark
https://ldbcouncil.org/benchmarks/snb
Apache License 2.0
166 stars 58 forks source link

Parameter generation is very slow #51

Closed ArnauPrat closed 4 years ago

ArnauPrat commented 6 years ago

According to reports by several users, for large datesets (eg. SF1000) the generation of parameters becomes the most expensive part of the generation process (90 minutes of generating data, 12 hours for generating the parameters). We should rethink its implementation (maybe porting it to a hadoop job).

mingxiw commented 5 years ago

we at tigergraph tried this data generation for SF-1000. It took 33+ hours. However, the problem is we could not find the parameter files under ldbc_snb_data/substitution_parameters. Have anyone successfully generated the parameters for SF-1000?

ArnauPrat commented 5 years ago

Can you look at the log files of parameter generation (parameters_bi.log and parameters_interactive.log). Any hint there?

CongyanLi01 commented 5 years ago

Continued for the comment from TigerGraph above: Here is the hint of these 2 logs: For "parameters_bi.log": loading input for parameter generation Traceback (most recent call last): File "paramgenerator/generateparamsbi.py", line 410, in sys.exit(main()) File "paramgenerator/generateparamsbi.py", line 330, in main readfactors.load(personFactorFiles,activityFactorFiles, friendsFiles) File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load for line in f.readlines(): File "/usr/lib/python2.7/codecs.py", line 696, in readlines return self.reader.readlines(sizehint) File "/usr/lib/python2.7/codecs.py", line 606, in readlines return data.splitlines(keepends) MemoryError

and for "parameters_interactive.log": loading input for parameter generation Traceback (most recent call last): File "paramgenerator/generateparams.py", line 258, in sys.exit(main()) File "paramgenerator/generateparams.py", line 133, in main (personFactors, countryFactors, tagFactors, tagClassFactors, nameFactors, givenNames, ts, postHisto) = readfactors.load(personFactorFiles, activityFactorFiles, friendsFiles) File "/home/ubuntu/datagen/ldbc_snb_datagen/paramgenerator/readfactors.py", line 72, in load for line in f.readlines(): File "/usr/lib/python2.7/codecs.py", line 696, in readlines return self.reader.readlines(sizehint) File "/usr/lib/python2.7/codecs.py", line 606, in readlines return data.splitlines(keepends) MemoryError

It seems that there is something wrong with the memory. I add the modification 'export HADOOP_CLIENT_OPTS="-Xmx200G"' in "run.sh" and the memory size of my machine is 244GB. Do you have any suggestions?

ArnauPrat commented 5 years ago

Parameter generation is implemented using a couple of python scripts, this is the reason it is so slow, because its execution is not parallelized in any way. Setting HADOOP_CLIENT_OPTS will have no effect on parameter generation. The parameter generation scripts under the folder "paramgenerator", use as input files the "factor" files, which are produced by datagen. These factor files, namely mXactivityFactors.txt, mXfriendList0.csv and mXpersonFactors.txt (where X can be any number between 0 and NumberOfWorkers-1) are produced by Datagen during data generation, and can be found under the /hadoop folder (either in local filesystem if you executed on standalone mode or in HDFS if executed on distributed or pseudo-distributed mode). If you can get these files, you can try to debug just the parameter generation part, without having to rerun the whole generation process.
Here is where the script is launched, and the first parameter to the script is where the factor files are.

szarnyasg commented 4 years ago

Closing this but the story continues in ~#206~ #83 .