Open basavaraj29 opened 1 year ago
This is great. Do we have any numbers on how much this improves pre-processing?
on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex
.
Excellent work!
Sent from my iPhone
On Nov 7, 2022, at 8:20 PM, Basava Kolagani @.***> wrote:
on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex.
— Reply to this email directly, view it on GitHubhttps://github.com/marius-team/marius/pull/123#issuecomment-1306077425, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH6W333EQFD32BOKLN4BXTWHFI4LANCNFSM6AAAAAARZP6PZA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
[ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/marius-team/marius/pull/123#issuecomment-1306077425", "url": "https://github.com/marius-team/marius/pull/123#issuecomment-1306077425", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
todo: