Offer data generators - Githubissues

Kixiron commented 2 years ago

Currently most other LDBC benchmarks offer data generator implementations that can be used/referenced, but currently graphalytics doesn't offer anything (that I can find). The datasets involved in graphalytics are very large and can be incredibly slow to download, so being able to generate them would be very nice

szarnyasg commented 2 years ago

Hi @Kixiron, there are some conversion scripts available at: https://github.com/ldbc/ldbc_graphalytics/tree/master/bin/utils/graph-format-conversion/graph-specific However, they do not have any documentation, so getting them running would require some trial and error.

My recommendation would be to stick with the data sets available at https://github.com/ldbc/data-sets-surf-repository. These can indeed be slow to download but I could always get ~10 MB/s, so even the largest file (graph500-30, 100GB) should download in 24 hours. Did you encounter slower speeds?

Kixiron commented 2 years ago

Well, that's pretty much the issue, it's hard to ask new devs or CI to take 2+ hours to download a benchmark

szarnyasg commented 2 years ago

Understood – but unfortunately, I don't see a quick fix for that.

Approach 1: Regenerating the data

Let's suppose we can get the data generation pipeline working such that it produces the exact same graph500, datagen, etc. graphs as the ones available for this benchmark. Still, it is going to be slow due to limitations in the generator and processing scripts. E.g. the datagen graphs should be generated with the old Hadoop-based Datagen which is single-threaded on a single machine (Hadoop's pseudo-distributed mode) and only parallelizes when run in an actual Hadoop cluster. Then, they are post-processed by a shell script using standard UNIX tools (sort, cat, cut) which sometimes dump intermediate data sets to disk. Overall, re-generating the data sets wouldn't be much faster than downloading them at the current (slow) rate.

Approach 2: Moving to a faster hosting service

Regarding faster hosting, I see two potential solutions:

Upload the data sets to the object storage services of the three major cloud providers and make them available with the "requester pays" (or equivalent) option. This is quite expensive for LDBC: napkin math shows ~1 USD / GB / year, and we have tens of terabytes of data for SNB and Graphalytics (some of these are currently being generated and not yet available). Also, thse cloud deployments will not going to benefit users with on-prem deployments, they will need to use the SURF service (which is also a paid one although it is subsidized).
Use an egress-free fast object storage service. Cloudflare R2 seems promising and is reasonably priced. It's still in beta and does not support public buckets out of the box. Getting public buckets requires a (seemingly simple) workaround of running a Cloudflare worker. Unfortunately, users in the Reddit thread on R2 bandwith report 200-300 megabits/s (25-37.5 MB/s) which still isn't great and 5-10x off from what e.g. S3 can do within AWS.

If you are aware of an alternative service for data hosting, please let me know.

szarnyasg commented 2 years ago

I deployed that data sets on Cloudflare R2 (URLs may change in the future). This does the job well.

ldbc / ldbc_graphalytics_docs

Offer data generators #18

Approach 1: Regenerating the data

Approach 2: Moving to a faster hosting service