Closed xmehaut closed 4 years ago
I actually meant that point for Rust Dataframe APIs. Yeah, by extension we will try to do the same for python version too. Koalas is awesome, but as far as I know, it supports only spark datatypes just like PySpark/Spark Dataframe APIs. We will also develop along its lines and if possible try to include generic types like pandas. In Rust, it is relatively easy. For Python, it is very early to guarantee anything.
Thanks for the answer. Does that mean that dataframes will be mutable too? If so, could something like deltalake features be included in fastspark? Last question, will there be an overall architecture schema explaining the fastspark processing? Ragards
Envoyé de mon iPhone
Le 13 nov. 2019 à 07:06, raja sekar notifications@github.com a écrit :
I actually meant that point for Rust Dataframe APIs. Yeah, by extension we will try to do the same for python version too. Koalas is awesome, but as far as I know, it supports only spark datatypes just like PySpark/Spark Dataframe APIs. We will also develop along its lines and if possible try to include generic types like pandas. In Rust, it is relatively easy. For Python, it is very early to guarantee anything.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Mutating the inner values and keeping it consistent across the whole system is not possible. However, I am hoping to have something like the following possible.
import math
import pandas as pd
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def euclid_dis(self, b):
return math.sqrt((self.x-b.x)**2 + (self.y-b.y)**2)
my_list = [Point(1, 1), Point(1, 1)]
plane_pd = pd.DataFrame([[p.x, p.y, p] for p in my_list],columns=list('XYO'))
plane["O"] = plane_pd["O"].apply(lambda x: x.euclid_dis(Point(0,0)))
This is just a small example. I hope you got the idea.
The project is still very early to comment on other features. With enough support from the community, we can slowly add them.
RDD and DAG scheduler is more or less the same as Spark which is already studied in detail by many. When I open-source the data frames, I will explain them in detail as they are slightly different from Sparks.
Feel free to open this issue again when we start developing python APIs
Hi I read in your paper (https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c) that you wanted to be inspired by panda for implementing dataframes (and api). You could consider basing your implementation on Koalas (https://github.com/databricks/koalas) regards