ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.36k stars 540 forks source link

feat(pyspark): manage streaming queries #9157

Open chloeh13q opened 3 weeks ago

chloeh13q commented 3 weeks ago

Is your feature request related to a problem?

No

What is the motivation behind your request?

Better support for streaming functionalities in Ibis.

Describe the solution you'd like

Pyspark provides methods to manage streaming queries: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html

If I'm submitting streaming workloads via Ibis, Ibis is managing the underlying compilation and submission of the query. Because streaming queries don't return and are run continuously in the background, sometimes I need to check on the status or stop the query. Right now I cannot do this directly in Python code because Ibis manages the query submission.

I think we can expose a wrapper class that allows users to interact with the streaming query in Ibis code to allow for a smoother user experience w/ streaming. I'm not sure whether this is within the scope of Ibis, but right now it's hard to manipulate the underlying query because Ibis does not return it (it will require Ibis returning the underlying pyspark StreamingQuery object).

What version of ibis are you running?

main

What backend(s) are you using, if any?

pyspark

Code of Conduct

gforsyth commented 3 weeks ago

As an initial implementation, returning the StreamingQuery object seems like a reasonable goal, and then we can explore further conveniences on top of that as we and users see fit.