rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

Koalas-like implementation #35

Closed xmehaut closed 4 years ago

xmehaut commented 4 years ago

Hi I read in your paper (https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c) that you wanted to be inspired by panda for implementing dataframes (and api). You could consider basing your implementation on Koalas (https://github.com/databricks/koalas) regards

rajasekarv commented 4 years ago

I actually meant that point for Rust Dataframe APIs. Yeah, by extension we will try to do the same for python version too. Koalas is awesome, but as far as I know, it supports only spark datatypes just like PySpark/Spark Dataframe APIs. We will also develop along its lines and if possible try to include generic types like pandas. In Rust, it is relatively easy. For Python, it is very early to guarantee anything.

xmehaut commented 4 years ago

Thanks for the answer. Does that mean that dataframes will be mutable too? If so, could something like deltalake features be included in fastspark? Last question, will there be an overall architecture schema explaining the fastspark processing? Ragards

Envoyé de mon iPhone

Le 13 nov. 2019 à 07:06, raja sekar notifications@github.com a écrit :

I actually meant that point for Rust Dataframe APIs. Yeah, by extension we will try to do the same for python version too. Koalas is awesome, but as far as I know, it supports only spark datatypes just like PySpark/Spark Dataframe APIs. We will also develop along its lines and if possible try to include generic types like pandas. In Rust, it is relatively easy. For Python, it is very early to guarantee anything.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rajasekarv commented 4 years ago

Mutating the inner values and keeping it consistent across the whole system is not possible. However, I am hoping to have something like the following possible.

import math
import pandas as pd
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def euclid_dis(self, b):
        return math.sqrt((self.x-b.x)**2 + (self.y-b.y)**2)

my_list = [Point(1, 1), Point(1, 1)]
plane_pd = pd.DataFrame([[p.x, p.y, p] for p in my_list],columns=list('XYO'))
plane["O"] = plane_pd["O"].apply(lambda x: x.euclid_dis(Point(0,0)))

This is just a small example. I hope you got the idea.

rajasekarv commented 4 years ago

The project is still very early to comment on other features. With enough support from the community, we can slowly add them.

RDD and DAG scheduler is more or less the same as Spark which is already studied in detail by many. When I open-source the data frames, I will explain them in detail as they are slightly different from Sparks.

rajasekarv commented 4 years ago

Feel free to open this issue again when we start developing python APIs