Open ELC opened 5 days ago
Thanks @ELC for your request! Yup, this definitely in scope and of interest!
I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.
Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now 😄 From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?
PySpark Pandas
It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection
In terms of contributing, I think if you take a look at narwhals._pandas_like
, that's where the implementation for the pandas APIs is. There's a _implementation
field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?
I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.
As per local development, there are a couple of options:
For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue
I will have a look at the _pandas_like
and _implementation
files to have a look at what's needed and will keep you posted on the progress.
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
No response
Please describe the purpose of the new feature or describe the problem to solve.
PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.
One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.
I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.
Suggest a solution if possible.
PySpark has two distinct APIs:
Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.
There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.
If you have tried alternatives, please describe them below.
No response
Additional information that may help us understand your needs.
I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers