ldbc / ldbc_snb_datagen_spark

Synthetic graph generator for the LDBC Social Network Benchmark, running on Spark
https://ldbcouncil.org/benchmarks/snb
Apache License 2.0
166 stars 58 forks source link

Document parameters #313

Closed szarnyasg closed 2 years ago

szarnyasg commented 3 years ago

The wiki of the previous (mix of Hadoop/Spark) repository had partial documentation of the user-facing parameters:

## User-facing Parameters

Users of the LDBC data generator specify configuration by means of the `params.ini` file. 

The `params.ini` file contains the following options:
* `generator.mode`
  + default: `interactive`
  + options: `interactive` `bi` `graphalytics` `rawData`
  + description: the mode the datagen executes in.
* `generator.scaleFactor`
  + default: `1` 
  + options: `0.003` `0.1` `0.3` `1` `3` `10` `30` `100`  
  + description: determines the generated data size. Note `0.003` `0.1` `0.3` are used for testing. Note, graphalytics scale factor is set with a different parameter see below. 
* `serializer.format`
  + default: `CsvBasic`
  + options: `CsvBasic` `CsvMergeForeign` `CsvComposite` `CsvCompositeMergeForeign`
  + description: determines the data serialization format
* `generator.numThreads`
  + default: `1`
  + description: determines the number of threads Hadoop using TODO: different now using Spark?

### Interactive

`interactive` mode only parameters:
* `generator.mode.interactive.numUpdateStreams`
  + default: `1`
  + description: determines the number update streams (consisting of inserts and deletes) for the Interactive workload. 

### BI

`bi` mode only parameters:
* `generator.mode.bi.batches`
  + default: `month`
  + options: `day` `month` `quarter`
  + description: determines batch time granularity
* `generator.mode.bi.deleteType` TODO: this feature has been decided against
  + default: `simple`
  + options: `simple` `complex`
  + description: determines delete operation type included in the batches. 

### Graphalytics

`graphalytics` mode only parameters
* `generator.scaleFactor`
  + default: `graphalytics.1` 
  + options: `graphalytics.1` `graphalytics.3` `graphalytics.10` `graphalytics.30` `graphalytics.100` `graphalytics.300` `graphalytics.1000` `graphalytics.3000`  `graphalytics.10000`  `graphalytics.30000`  
  + description: determines the generated data size. Note `0.003` `0.1` `0.3` are used for testing. Note, 
* `generator.degreeDistribution`
  + default:`Facebook`
  + options: `Facebook` `Altmann` `Weibull` `Empirical` `Geo` `MoeZipf` `Zipf`
  + description: 

## Internal Parameters

Internal parameters are divided into two categories `generator` and `hadoop` related.

### Generator
These are determined by `generator.scaleFactor` 
* `generator.numPersons`
  + default: `10000`
  + description: the number of persons to generate 
* `generator.startYear`
  + default: `2010`
  + description: the start year of the simulation
* `generator.numYears`
  + default: `3`
  + description: the number of years to simulate
* `generator.delta`
+ default: `10000` 
+ description: the minimum time between two operations
* `generator.dateFormatter`
  + default: `StringDate`
  + options: `StringDate` `LongDate`
* `generator.StringDate.dateTimeFormat`
  + default: `yyyy-MM-dd'T'HH:mm:ss.SSS+00:00`
* `generator.StringDate.dateFormat`
  + default: `yyyy-MM-dd`
* `generator.knowsGenerator`
  + default: `Distance`
  + options: `Distance` `Bter` `Clustering` `Random` 
  + description:
* `generator.person.similarity`
  + default: `GeoDistance`
  + options: `GeoDistance` `Interests`
  + description:

### Hadoop
* `hadoop.serializer.compressed`
  + default: `false`
  + description:
* `hadoop.serializer.endlineSeparator`
  + default: `false`
  + description:
* `hadoop.serializer.socialNetworkDir`
  + default: `./social_network`
  + description: TODO this might be a duplicate with `outputDir`
* `hadoop.serializer.hadoopDir`
  + default: `./hadoop`
  + description:
* `hadoop.serializer.outputDir`
  + default: `./social_network/`
  + description:
szarnyasg commented 2 years ago

The current README now documents the user-facing params.