ldbc-dev / ldbc_snb_datagen_deprecated2015

LDBC-SNB Data Generator
GNU General Public License v3.0
12 stars 5 forks source link

Data doesn't get serialized from HDFS #19

Closed MarcusParadies closed 10 years ago

MarcusParadies commented 10 years ago

I have cloned and compiled the latest version of the data generator. (based on commit 3fbdb0e6ea8be82d20a6ff5be521a7e8d28eafd4)

The jobs complete successfully, but no output is serialized to the specified directory.

Here is my params.ini file:

scaleFactor:1 serializer:csv compressed:false updateStreams:false outputDir:/home/myuser/ldbc numThreads:6

I can see that the jobs produced some files on HDFS containing useful data, but the last step that serializes the block files into the final files does not work.

Any idea how to solve this?

Thanks, Marcus

ArnauPrat commented 10 years ago

Hi @MarcusParadies, Which execution mode are you using? Standalone or Pseudo-distributed mode?

MarcusParadies commented 10 years ago

I'm using the pseudo-distributed mode.

ArnauPrat commented 10 years ago

have you tried navigating through the directory structure of HDFS? What files are actually created? Which version of hadoop are you using?

MarcusParadies commented 10 years ago

I'm using Hadoop 1.2.1

The HDFS is located at /tmp/hdfs/data and contains a couple of subfolders.

The 'current' subfolder contains a couple of generated files containing the CSV headers(?) and in subdirXX are block files containing the actual data.

So I'm assuming that the files have been generated successfully and written back to HDFS. But the serialization into the final csv files does not seem to work. (also doesn't work for other output formats such as ttl btw.)

MarcusParadies commented 10 years ago

According to https://github.com/ldbc/ldbc_snb_datagen/wiki/Data-Output there should be a social_network being generated on HDFS(?). This one does not exist anywhere on my system.

As a side note: The generation of the substitution parameters did produce the expected text files.

ArnauPrat commented 10 years ago

can you paste the output of DATAGEN please?

MarcusParadies commented 10 years ago

marcus@lu285378:~/git/ldbc_snb_datagen$ ./run.sh [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Building Unnamed - ldbc.socialnet.dbgen:ldbc_snb_datagen:jar:0.1 [INFO] task-segment: [clean] [INFO] ------------------------------------------------------------------------ [INFO] [clean:clean {execution: default-clean}] [INFO] Deleting file set: /home/marcus/git/ldbc_snb_datagen/target (included: [**], excluded: []) [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1 second [INFO] Finished at: Tue Oct 28 11:39:21 CET 2014 [INFO] Final Memory: 13M/724M [INFO] ------------------------------------------------------------------------ [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Building Unnamed - ldbc.socialnet.dbgen:ldbc_snb_datagen:jar:0.1 [INFO] task-segment: assembly:assembly [INFO] ------------------------------------------------------------------------ [INFO] Preparing assembly:assembly [INFO] ------------------------------------------------------------------------ [INFO] Building Unnamed - ldbc.socialnet.dbgen:ldbc_snb_datagen:jar:0.1 [INFO] ------------------------------------------------------------------------ [INFO] [resources:resources {execution: default-resources}] [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] Copying 255 resources [INFO] [compiler:compile {execution: default-compile}] [INFO] Changes detected - recompiling the module! [WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent! [INFO] Compiling 102 source files to /home/marcus/git/ldbc_snb_datagen/target/classes [WARNING] /home/marcus/git/ldbc_snb_datagen/src/main/java/ldbc/socialnet/dbgen/serializer/CSVSerializer/CSVSerializer.java: Some input files use unchecked or unsafe operations. [WARNING] /home/marcus/git/ldbc_snb_datagen/src/main/java/ldbc/socialnet/dbgen/serializer/CSVSerializer/CSVSerializer.java: Recompile with -Xlint:unchecked for details. [INFO] [resources:testResources {execution: default-testResources}] [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent! [INFO] skip non existing resourceDirectory /home/marcus/git/ldbc_snb_datagen/src/test/resources [INFO] [compiler:testCompile {execution: default-testCompile}] [INFO] No sources to compile [INFO] [surefire:test {execution: default-test}] [INFO] No tests to run. [INFO] [jar:jar {execution: default-jar}] [INFO] Building jar: /home/marcus/git/ldbc_snb_datagen/target/ldbc_snb_datagen-0.1.jar [INFO] [assembly:assembly {execution: default-cli}] [INFO] Building jar: /home/marcus/git/ldbc_snb_datagen/target/ldbc_snb_datagen.jar [WARNING] Configuration options: 'appendAssemblyId' is set to false, and 'classifier' is missing. Instead of attaching the assembly file: /home/marcus/git/ldbc_snb_datagen/target/ldbc_snb_datagen.jar, it will become the file for main project artifact. NOTE: If multiple descriptors or descriptor-formats are provided for this project, the value of this file will be non-deterministic! [WARNING] Replacing pre-existing project main-artifact file: /home/marcus/git/ldbc_snb_datagen/target/ldbc_snb_datagen-0.1.jar with assembly file: /home/marcus/git/ldbc_snb_datagen/target/ldbc_snb_datagen.jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 28 seconds [INFO] Finished at: Tue Oct 28 11:39:50 CET 2014 [INFO] Final Memory: 65M/1762M [INFO] ------------------------------------------------------------------------ Warning: $HADOOP_HOME is deprecated.

***** Configuration ***** scaleFactor: 1 numThreads: 6 serializer: csv compressed: false updateStreams: false outputDir: /home/marcus/data/ldbc numUpdatePartitions: 1


NUMBER OF THREADS 6


ArnauPrat commented 10 years ago

have you checked what there is inside HDFS, using the following command? hadoop fs -ls /

MarcusParadies commented 10 years ago

Aha! I think I know where the confusion comes from. I was expecting that the generated data files will be copied (similar to the substitution_parameters) to the directory specified by 'outputDir'. Apparently, this folder is not a local folder but a HDFS folder.

So the copy from HDFS to the local file system is not part of the generation process but has to be done by the user in a separate step.

When I now read the documentation again, it makes sense. Maybe you can describe the data output generation a bit more in detail (https://github.com/ldbc/ldbc_snb_datagen/wiki/Data-Output).

So if you are ok with that, we can close this "issue". :-)

ArnauPrat commented 10 years ago

hahaha good news then :+1: We will try to improve that part in the documentation Thx!