Thrashing for SF30k with default settings

szarnyasg commented 1 year ago

As remarked in #321, the default settings can cause some thrashing. This may still be true today (although the Datagen is much better optimized now).

The Python script should be altered such that it uses more machines for large SFs (e.g. ~20 for SF30k).

The expected length of the generation job should also be documented.

szarnyasg commented 1 year ago

The time required to generate SF30,000 on AWS EMR with 20 i3.4xlarge instances is ~12 1/4 hours:

9 1/4 hours for the generation (Run LDBC SNB Datagen step)
3 hours for copying the data to S3 (S3 dist cp step)

Running the factor generator in its current form is very slow, this needs further investigation.

Yourens commented 1 year ago

I'm struggling to generate a dataset of SF10K using Spark.

So far, I have attempted to install Spark locally and run it with "--parallelism 8 --memory 96G". However, After about 2 hours, I receive a 'java.lang.OutOfMemoryError: Java heap space' error. I then reduced the concurrency level to 2, but after running for 12 hours, I received a 'Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval' error. Currently, I am following Spark's suggestions and adjusting the heartbeat timeout time before running it again. I am unsure if I have misconfigured something else. Should we provide a tutorial on how to generate large datasets with Spark if we can't make it with default config.

Here is a full version of my command now: ./tools/run.py --cores 64 --parallelism 2 --memory 96G --conf spark.network.timeout=120000 spark.executor.heartbeatInterval=10000 -- --format csv --scale-factor 10000 --mode bi --explode-edges --output-dir /largedata/sf10000

szarnyasg commented 1 year ago

Hi Youren, you will need a larger machine. SF10,000 needs 4 i3.4xlarge instances which have 122 GiB memory each.

Also, the factor generation at the moment is a very expensive step. This something we'll fix in the near future - until then, make sure you do not use --generate-factors.

dszakallas commented 1 year ago

@Yourens you should try to increase parallelism not decrease it. It controls the number of partitions generated |> results in smaller partitions |> more likely that each partition fits into memory. The theoretical limit for SF10K seems to be 2720 based on numPersons / blockSize. I admit this name is unintuitive, but it follows the Spark naming.

dszakallas commented 1 year ago

We run this with 1000 partitions altogether without memory issues on 122 GB machines. With 96GB memory you might want to increase this somewhat. But if you don't care about small files, run with 2720, that's likely to succeed.

ldbc / ldbc_snb_datagen_spark

Thrashing for SF30k with default settings #428