rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 207 forks source link

remove serialization of duplicate data in dependencies along with task #110

Open rajasekarv opened 4 years ago

AmbitionXiang commented 3 years ago

Hi, when I ran a sample called 'Transitive closure on a graph', the typical sample in Spark https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkTC.scala. I found that the total number of bytes grew too fast to run to completion. Only two or three iterations will exhaust my memory. The problem seems related to this issue. If I want to contribute to it, what's the main problem when solving, and could you please give me some hints?

AmbitionXiang commented 3 years ago

Hi, I've finished it. Thanks.

rajasekarv commented 3 years ago

Hello @AmbitionXiang

Hope you are doing well. Thanks for checking it and bringing out the issue. Yeah, due to data duplication in serialization, it can go out of memory very quickly if the data flow branches out a lot. It is a long-pending issue and since I am busy with personal work, I never got time to work on it.  I plan to resume the work on the project in about a month and I will be managing it actively this time. If you have done some work please raise a Pull Request and I will merge it after reviewing it.  Thanks a lot for your support