Should vizro support polars (or other dataframes besides pandas)?

antonymilne commented 10 months ago

Ty Petar, please consider supporting polars, I think it is necessary, given that the whole point of vizro is working with a dataframe in memory. Currently vizro cannot determine polars column names (detects them as [0,1,2,3,4...])

Originally posted by @vmisusu in https://github.com/mckinsey/vizro/issues/191#issuecomment-1845368168

I'm opening this issue to see whether other people have the same question so we can figure out what priority it should be. Just hit 👍 if it's something you'd like to see in vizro and feel free to leave and comments.

The current situation (25 January 2024) is:

vizro currently only supports pandas DataFrames, but supporting others like polars a great idea and something we did consider before. The main blocker previously was that plotly didn't support polars, but as of 5.15 it supports not just polars but actually any dataframe with a to_pandas method, and as of 5.16 it supports dataframes that follow the dataframe interchange protocol (which is now pip installable)
on vizro we could follow a similar sort of pattern to plotly's development[^1]. Ideally supporting the dataframe interchange protocol is the "right" way to do this, but we should work out exactly how much performance improvement polars users would actually get in practice to see what the value of this would be over a simple to_pandas call. The biggest changes we'd need to make would be to actions code like filtering functionality (FYI @petar-qb). I don't think it would be too hard, but it's certainly not a small task either

From @Coding-with-Adam:

Chad had a nice app that he built to compare between pandas and polars and show the difference when using Dash. https://dash-polars-pandas-docker.onrender.com/ (free tier) I also made a video him: https://youtu.be/_iebrqafOuM And here’s the article he wrote: Dash: Polars vs Pandas. An interactive battle between the… | by Chad Bell | Medium

FYI @astrojuanlu

[^1]: https://github.com/plotly/plotly.py/pull/4244 https://github.com/plotly/plotly.py/pull/4272/files https://github.com/plotly/plotly.py/pull/3901 https://github.com/plotly/plotly.py/issues/3637

datajoely commented 9 months ago

Maybe Ibis is a good fit here?

astrojuanlu commented 9 months ago

I only reacted with 🚀 to this, but to make my position more clear,

as of 5.15 it supports not just polars but actually any dataframe with a to_pandas method, and as of 5.16 it supports dataframes that follow the dataframe interchange protocol (https://github.com/data-apis/dataframe-api/issues/73)

this is awesome ⭐

but we should work out exactly how much performance improvement polars users would actually get in practice to see what the value of this would be over a simple to_pandas call.

I think it's more of a DX experience, not necessarily performance improvement. If folks are using Polars for whatever reason and then they have to do .to_pandas() to use Vizro, it feels a bit meh. If Vizro supports Polars natively, it's more pleasant.

astrojuanlu commented 9 months ago

Just seen on their LinkedIn:

Check migrated all 100+ of their Airflow DAGs from pandas to Polars and saved 25% in cloud expenses.

https://pola.rs/posts/case-check-technology/

reouvenzana commented 5 months ago

Any update about Polars integration? More and more people are using it.

datajoely commented 5 months ago

Supporting Narwhals may be a sensible 1st step since we get many birds with one stone... https://github.com/narwhals-dev/narwhals

Would also like to support Ibis for the same reason.

antonymilne commented 5 months ago

Narwhals looks very interesting, thanks for pointing it out @datajoely.

@reouvenzana no updates on this - the current situation outlined in the first post still applies here. It's something I'd still like to do but it just hasn't been prioritised yet. Your comment here though does help to bump the priority up!

When we do implement this, whatever we do is likely to closely follow plotly's pattern to begin, so there won't be any performance improvements, just a DX improvement as @astrojuanlu suggested above.

@reouvenzana are you interested in using polars in vizro for performance improvements or just for ease of use to avoid doing a to_pandas call?

reouvenzana commented 5 months ago

@antonymilne

Your comment here though does help to bump the priority up!

Nice!

@reouvenzana are you interested in using polars in vizro for performance improvements or just for ease of use to avoid doing a to_pandas call?

Honestly, it's more about the api / functionalities of Polars (easy method chaining, list columns which are really useful) than performance issues. It's bothersome to have pl.DataFrame, df.to_pandas() everywhere in my code, and also to handle the mismatch between data types. I'm aware that Dash / Plotly "supports" polars, though there is no performance gain as you've pointed out. Still, it would be great, given the increasing traction behind Polars.

antonymilne commented 5 months ago

Got it, thank you @reouvenzana, that makes a lot of sense 👍

For future reference the logic they use to do the conversion to pandas is here: https://github.com/plotly/plotly.py/blob/51eb5ea9fefda27bccfdb21e660b8d4035cef3b0/packages/python/plotly/plotly/express/_core.py#L1323-L1353. So any pandas>= 2.0.2 will use __dataframe__ rather than to_pandas. We would probably do something similar to this to begin with anyway.

mckinsey / vizro

Should vizro support polars (or other dataframes besides pandas)? #286