Make it possible to use Apache Spark Dataframe instead of Pandas

agerwick commented 8 months ago

This is a request about using CellPy on a cloud platform, and specifically for using Unity Catalog for data Governance, which is useful for example if you want to use databricks. Unity Catalog uses Apache Spark Dataframes internally, and much of its advanced functinolality comes from the use of Spark. The advantages with Unity Catalog are many. The use of Spark is because it's built for distributed servers, while pandas is meant for a single machine. And with Unity Catalog you can trace your data down to row level from source to destination (for example from Bronze layer via Silver to Gold layer in a medallion architecture of a Lakehouse setup. If there's an error, you can trace it back to exactly where it originated. Without it, we can only trace data on the file level. Here's some info about Unity Catalog: https://www.databricks.com/resources/demos/videos/data-governance/unity-catalog-overview

Would it be possible for example to abstract all calls to a dataframe, and have a config parameter to tell CellPy whether to use Pandas or Spark?

There are a number of syntactical differences between the two types of dataframes, so it's not possible to simply convert a pandas df to spark. Here are some of the differences: https://gist.github.com/agerwick/fe187b2acbd2144f87002995128cd53b (note that I haven't tested all of these)

This is a long term request, not something that is urgently needed. But it would be interesting to know approximately how much work this would be. If it is not something that requires a major rewrite, we can consider to do it sooner.

jepegit commented 8 months ago

Yes, agree. Data governance is important. Using the config to set the "dataframe backend" sounds like a good idea. And maybe it also open up future possibilities of letting users use polars instead if they want? @agerwick , any suggestion on what kind of Label is most appropriate for this here in GitHub? Maybe "enhancement"? Or maybe make a new called "strategic goal" or something?

agerwick commented 8 months ago

I guess an "enhancement" is an appropriate label, although it's a pretty major one, so I see why you would consider differentiating it. And yes, this could also open up for using polars as well, which would be neat! I'm surprised I haven't been able to find a "dataframe abstraction layer", with multiple backends yet... Maybe we should make one? Or maybe it's simply not feasible, as Spark, for example, lacks certain concepts that Pandas has, such as an index column. We'd have to provide abstractions for the different things you use index columns for and do it a different way if the backend doesn't have one.

jepegit / cellpy

Make it possible to use Apache Spark Dataframe instead of Pandas #302