[Enh]: Add Support For PySpark

ELC commented 5 months ago

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.

One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.

I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.

Suggest a solution if possible.

PySpark has two distinct APIs:

PySpark SQL - Identical API to Scala version of Spark
PySpark Pandas - Subset of Pandas API (Previously known as Koalas)

Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.

There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers

MarcoGorelli commented 5 months ago

Thanks @ELC for your request! Yup, this definitely in scope and of interest!

jahall commented 5 months ago

I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.

MarcoGorelli commented 5 months ago

Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now 😄 From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?

PySpark Pandas

It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection

In terms of contributing, I think if you take a look at narwhals._pandas_like, that's where the implementation for the pandas APIs is. There's a _implementation field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?

ELC commented 5 months ago

I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.

As per local development, there are a couple of options:

Setting PySpark locally - Really troublesome
Using VS Code with Dev Containers - Requires docker desktop installation
Using Github's workspaces with a PySpark image - Easiest but relies on a working internet connection

For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue

I will have a look at the _pandas_like and _implementation files to have a look at what's needed and will keep you posted on the progress.

TomBurdge commented 3 months ago

Hey folks, I have been working on this feature on a local branch. I have a few questions:

What is the best way to gradually develop and implement this?

Adding support for a whole DataFrame library is quite a big feature, and it is a little all or nothing. I think it would be useful if any changes/choices being made are picked up early on.

Have we considered the following for pyspark pandas?

pyspark pandas converts to pyarrow, then pandas. There is quite often type confusion somewhere in the process. There is also a significant, but often overlooked, overhead of these conversions. Performance is generally much better with straight pyspark.
pyspark pandas doesn't work well with the latest version of pandas, it is particularly unstable with pandas >2. I have had to limit the pandas version in dev-requirements in my local branch to pandas<2.1.0 because pandas pyspark still uses this deprecation.

The local branch I have been working on uses straight pyspark. It would also be possible to add pyspark pandas as it's own pandas-like implementation AND the straight pyspark API and give users the choice.

This is perhaps something to be explored in a later feature, but if we add support for the pyspark API we could possibly have a root to much more library support through sqlframe. I prefer to implement pyspark over pandas pyspark whether or not sql frame is an option for narwhals.

I am not developing using pyspark pandas, but I still had to limit my pandas version to <2.1.0 on my local branch. Why is this?

Following the lead from dask lazyframe, I have been implementing the collect method as returing a pandas pandaslike DataFrame. Spark then needs to still needs to interface with pandas, but this then has the deprecation issue >=2.1.0

I think there might be a better way. The Pyspark 4.0.0 preview adds a toArrow feature. The collect -> eager implementation can then return an arrow DataFrame. There is only one conversion here (pyspark -> arrow) rather than three (pyspark -> pandas arrow). Ritchie Vink has made a related comment about the two conversion being preferable in the past, back when there was no official way of going to arrow from spark. The challenge here is then this would make at least the eager implementation in narwhals incompatible with pyspark<4.0.0 (when spark 4.0.0 is actually released). Any thoughts on this?

TomBurdge commented 3 months ago

hey @ELC I usually use SDKMAN when developing pyspark locally and it usually works pretty nicely in my experience (on WSL, set up on a few different laptops). It seems like a lot of pyspark developers aren't aware of this approach, because the tool is most used by java developers, and plenty of pyspark developers don't know java (at least I don't ☺️ ).

SDKMAN allows you to, at least the way I think about, create the java equivalent of a virtual directory for your repo via a .sdkmanrc file at the root. Right now I have the following:

java=21.0.4-tem
spark=3.5.1
hadoop=3.3.5

along with adding pyspark to the dev-requirements.txt, I was good to go. I do agree with you that some approach which uses containerization is likely to work best for the most people.

MarcoGorelli commented 3 months ago

Nice one, thanks @TomBurdge ! would be great to have you on board as contributor here!

I think converting to Arrow might make sense in collect. If anyone wants to support earlier versions of spark and convert to pandas, they can always do that manually with something like

if is_spark_lazyframe(df):
    df = nw.from_native(nw.to_native(df).toPandas(), eager_only=True)
else:
    df = df.collect()

But I think that data science libraries which explicitly support Spark are quite rare unfortunately - well, fortunately for us, it means that backwards-compatibility is less of a concern, and so they likely wouldn't mind only support Spark 4.0+ too much 😄

TomBurdge commented 3 months ago

Thanks @MarcoGorelli , yeah keen to get involved, spare time permitting! I will try to make it to one of the community calls.

Question on intended behaviour:

For the rename method, the relevant spark function is: withColumnRenamed. It has a major gotcha, which is that if the original column is not present, then it is a no-op (mentioned in the docs). This behaviour has caused me a lot of frustrations. 🙃

What is the narwhals intended behaviour? Strict? The narwhals tests in tests/frame/rename_test.py aren't opinionated on strict/non-strict currently.

MarcoGorelli commented 3 months ago

ah interesting, thanks @TomBurdge ! I think that, as Narwhals is mainly aimed at tool-builders, we'd be better-off using the strict behaviour

EdAbati commented 2 months ago

Hi @TomBurdge I completely missed these last replies and I have started implementing something for this. Do you also have something implemented? We can collaborate on this and merge what we have :)

I think we should aim for a "minimal" implementation now and then have follow-up PRs for the rest of the methods.

morrowrasmus commented 2 months ago

Just letting you know that I'm following this with great interest and could potentially be helpful with testing once you have something ready.

mattcristal commented 2 months ago

I am new to Spark, just for fun, I tried it locally with: https://github.com/jplane/pyspark-devcontainer

MarcoGorelli commented 1 week ago

Given that DuckDB also has a PySpark API, maybe this would give us DuckDB support almost for free (we could make a _spark_like folder, like the current _pandas_like one) https://duckdb.org/docs/api/python/spark_api.html

narwhals-dev / narwhals