narwhals-dev / narwhals

Lightweight and extensible compatibility layer between Polars, pandas, cuDF, Modin, and more!
https://narwhals-dev.github.io/narwhals/
MIT License
218 stars 31 forks source link

[Enh]: Add Support For PySpark #333

Open ELC opened 5 days ago

ELC commented 5 days ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.

One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.

I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.

Suggest a solution if possible.

PySpark has two distinct APIs:

Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.

There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers

MarcoGorelli commented 5 days ago

Thanks @ELC for your request! Yup, this definitely in scope and of interest!

jahall commented 3 days ago

I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.

MarcoGorelli commented 3 days ago

Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now 😄 From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?

PySpark Pandas

It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection

In terms of contributing, I think if you take a look at narwhals._pandas_like, that's where the implementation for the pandas APIs is. There's a _implementation field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?

ELC commented 2 days ago

I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.

As per local development, there are a couple of options:

For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue

I will have a look at the _pandas_like and _implementation files to have a look at what's needed and will keep you posted on the progress.