Closed ArnauPrat closed 4 years ago
we at tigergraph tried this data generation for SF-1000. It took 33+ hours. However, the problem is we could not find the parameter files under ldbc_snb_data/substitution_parameters. Have anyone successfully generated the parameters for SF-1000?
Can you look at the log files of parameter generation (parameters_bi.log and parameters_interactive.log). Any hint there?
Continued for the comment from TigerGraph above:
Here is the hint of these 2 logs:
For "parameters_bi.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparamsbi.py", line 410, in
and for "parameters_interactive.log":
loading input for parameter generation
Traceback (most recent call last):
File "paramgenerator/generateparams.py", line 258, in
It seems that there is something wrong with the memory. I add the modification 'export HADOOP_CLIENT_OPTS="-Xmx200G"' in "run.sh" and the memory size of my machine is 244GB. Do you have any suggestions?
Parameter generation is implemented using a couple of python scripts, this is the reason it is so slow, because its execution is not parallelized in any way. Setting HADOOP_CLIENT_OPTS will have no effect on parameter generation.
The parameter generation scripts under the folder "paramgenerator", use as input files the "factor" files, which are produced by datagen. These factor files, namely mXactivityFactors.txt, mXfriendList0.csv and mXpersonFactors.txt (where X can be any number between 0 and NumberOfWorkers-1) are produced by Datagen during data generation, and can be found under the /hadoop folder (either in local filesystem if you executed on standalone mode or in HDFS if executed on distributed or pseudo-distributed mode).
If you can get these files, you can try to debug just the parameter generation part, without having to rerun the whole generation process.
Here is where the script is launched, and the first parameter to the script is where the factor files are.
Closing this but the story continues in ~#206~ #83 .
According to reports by several users, for large datesets (eg. SF1000) the generation of parameters becomes the most expensive part of the generation process (90 minutes of generating data, 12 hours for generating the parameters). We should rethink its implementation (maybe porting it to a hadoop job).