Spark preprocessor optimization

basavaraj29 commented 1 year ago

removing id assignment for edges
using zipwithindex instead of repartition(1) and windowing
parititonBy([src_bucket, dst_bucket])

todo:

custom binary writer to eliminate intermediate csv

shivaram commented 1 year ago

This is great. Do we have any numbers on how much this improves pre-processing?

basavaraj29 commented 1 year ago

on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex.

thodrek commented 1 year ago

Excellent work!

Sent from my iPhone

On Nov 7, 2022, at 8:20 PM, Basava Kolagani @.***> wrote:

on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex.

— Reply to this email directly, view it on GitHubhttps://github.com/marius-team/marius/pull/123#issuecomment-1306077425, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH6W333EQFD32BOKLN4BXTWHFI4LANCNFSM6AAAAAARZP6PZA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

[ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/marius-team/marius/pull/123#issuecomment-1306077425", "url": "https://github.com/marius-team/marius/pull/123#issuecomment-1306077425", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

marius-team / marius

Spark preprocessor optimization #123