Open nikhil96sher opened 1 year ago
Tasks Tracking
PyIncrementalExecutionGraph
class that provides a Rust-Python interface and the given functions. Initially, the set of supported operations can be limited to reader
.reader
)appender
)hash_join
, merge_join
)accumulator
)
py-wake
Things to figure out
Interface similar to pandas:
df1 = edf.read_csv(file_names)
- Only creates and doesn't execute.To obtain actual result, you need to call another function on the dataframe.
WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.
WAY 2: Getting Result Progressively. Stopping execution when good enough.
df1.run_online(callback_function)
The good part is - stopping execution stops execution of all nodes.
We are still able to have some benefit of pipelining since multiple steps.
How to provide a mechanism to stop execution while preserving progress?
df2 = df1.filter(lambda x: x["quantity"] > 5.0)
Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):df3 = df2.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))
Alternate way - by default inplace.
[read_csv] => [appender] => [appender] => [RESULT-DF1]
Challenges