Open MarcoGorelli opened 1 week ago
Not a maintainer here, but #3426, #3384 and this all landing in the next release sounds great to me - if all goes well.
Found some additional pyarrow
references in data.py
, but most lead here:
https://github.com/vega/altair/blob/76a9ce1e27a21359e7b0ef9e6cb865f8d95fc298/altair/utils/data.py#L406-L408
Also, if this ends up being a PR and only changes a few files - it may be worth checking https://github.com/vega/altair/pull/3431/files since there are a lot of minor changes that may have simplified/standardized what is on main
.
Thanks for opening this issue! I'm surely in favor of improving native polars support without relying on a pyarrow dependency. As you have noticed, we still need to distill existing pandas serialization logic to become more dataframe-agnostic.
If I understand right, the suggested change here is to introduce a direct link to polars for a polars DataFrame if that object is compatible with the dataframe interchange protocol?
Eg, some changes here: https://github.com/vega/altair/blob/76a9ce1e27a21359e7b0ef9e6cb865f8d95fc298/altair/utils/data.py#L406-L408
The introduction of the dataframe interchange protocol is not only for polars, but others are using it too, eg ibis ( https://github.com/vega/altair/issues/3110), or custom objects as in https://github.com/vega/vegafusion/issues/386.
I'm not really in favor of introducing an optional polars dependency, but in line with https://github.com/vega/altair/pull/3377, we are already working towards a direction where we can use direct arrow conversion methods without using the pyarrow from_dataframe interface. If we can improve this direction even more so that pyarrow is not a hard dependency anymore within the part that serialize objects that are compatible with the dataframe interchange protocol that would be great.
Simultaneously, I also noticed that vegafusion has recently introduced an alternative approach for serializing objects that are compatible with polars, see eg this part:
Anyways! PRs in this direction are welcome!
Thanks for your response! 🙏
I think there are some methods which aren't covered by the interchange protocol unfortunately, such as converting a Date / Datetime column to string. So, with regards to
I'm not really in favor of introducing a optional polars dependency
would you be OK with having a couple of specialised Polars branches, where you can do something like
if (pl := sys.modules.get('polars') is not None) and isinstance(df, pl.DataFrame):
import polars.selectors as cs
df = df.with_columns(cs.date().dt.to_string('%F'))
return df
? I think the diff should be quite minimal
I think there's only 2 places where that would be necessary. Other parts where you use the dataframe interchange protocol are fine, it's just the part where you convert to a pyarrow table which could be skipped for the Polars case
Anyways! PRs in this direction are welcome!
Thanks! I'll put something together to drive this forwards, just wanted to check that you'd be OK with a couple (just a couple!) of specialised Polars paths
First, we try to solve things pragmatic if the dataframe interchange protocol is incomplete. There is not always a royal road. But for my own understanding, isn't this what we are doing already in https://github.com/vega/altair/pull/3377, where we use a direct to_arrow
method on a dataframe that supports the dataframe interchange protocol?
Yeah, and to_arrow
requires PyArrow as a dependency :) In the Polars case, Altair is converting Polars to PyArrow, to do things which are already possible (and fairly easy) to do in Polars directly. And that feels like a missed opportunity to keep Altair really lightweight for Polars users who otherwise wouldn't need to bloat their environment up with PyArrow (which is a huge dependency, and not generally required for working with Polars)
Ah.. therefor raising this issue:)
So the sanitize_dataframe
function:
https://github.com/vega/altair/blob/62ab14dcedb52f0a32c8de613c4e6883bdba8796/altair/utils/core.py#L309-L310
should be renamed to sanitize_pandas_dataframe
And next to sanitize_arrow_table
https://github.com/vega/altair/blob/62ab14dcedb52f0a32c8de613c4e6883bdba8796/altair/utils/core.py#L434-L437
Will come a new sanitize_polars_dataframe
function, with a couple of specialised Polars paths?
I agree that pyarrow is a rather large dependency and we hope to stay lightweight, especially for wasm environments.
Yup, thanks for appreciating the value in this! Alright, PR incoming this week
Hi @MarcoGorelli, just catching up on this thread and on your and @dangotbanned's comments in https://github.com/vega/altair/pull/3384.
I'm all in favor of streamlining polars support in Altair, but I'm less enthused about having polars specific logic scattered around the code base. On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals? Then ideally over time we could move the other DataFrame types that narwhals supports behind this code path as well.
cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.
Thanks for your response!
On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals?
Well, I'm not going to make you ask me twice 😄 Thanks, I'll give this a go and open a PR for your consideration! I don't think it'll be too big of an effort, if you then decide you don't want it, I promise no offense will be taken 😇
Just browsed through the narwhals docs, sounds very interesting! I'm also onboard to allow Polars users to use Altair without the need to have PyArrow or Pandas alongside it.
@MarcoGorelli Thanks for offering to open a PR! Just for you to know, https://github.com/vega/altair/pull/3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.
@MarcoGorelli Altair supports Pandas >= 0.25. I think it would be ok to bump this to 1.1.5 in your PR to be in line with Narwhal.
Btw, really enjoyed your talk at PyData Berlin in April :)
@MarcoGorelli Thanks for offering to open a PR! Just for you to know, #3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.
Just fyi, #3431 is merged.
cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.
We like it when Altair is lightweight and dataframe agnostic.
We have experimented with adopting the dataframe interchange protocol through relying on pyarrow.interchange.from_dataframe
to reach that goal.
Eventhough the interface protocol brought us a lot in relation to becoming dataframe agnostic, but not all (ref #3377) and unfortunately pyarrow is not a library that makes Altair lightweight.
If there are better options to become dataframe agnostic and lightweight by not having hard dependencies on heavy dataframe packages, than it seems like something to consider seriously.
If we found out that column type inference and data serialization for objects that are like dataframes can be delegated safely to narwhals, than it at least is something we should consider as a potential candidate.
I say, 'like dataframes', because I don't think narwhals will supports arrays is it? (something we eventually hope to cover one day). Surely we aim not to introduce regressions during this consideration.
Could we jump straight to supporting polars without pyarrow using narwhals?
Here we go 🚀 https://github.com/vega/altair/pull/3452
I totally agree that it's an imperative that this would come with zero regressions. I'm not aware of any:
I don't think narwhals will supports arrays is it?
That's right, arrays are a different beast - I'd suggest taking a look at https://data-apis.org/array-api/latest/ for that. Array libraries were sufficiently aligned to begin with that a standard like that was possible and successful, whereas for dataframes the related effort was discontinued (and that's how Narwhals was born - I'm not calling it a "standard", but I am trying to use my position as pandas + Polars dev to enable libraries to support the latter at no cost to the former)
I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)
I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)
- Regarding maintainability, there are other major libraries considering using Narwhals (scikit-learn, shiny), so I think there's enough interested parties such that, if we set up a good governance model, it could thrive even if I disappeared. I've already given write and admin access to some people I trust and who have made really solid contributions
- Regarding breaking changes: we're aiming for a "perfect backwards compatibility" policy in Narwhals, inspired by Rust Editions, see this RFC
Thanks for addressing this proactively. If other major libraries use it and if it has such a compatibility policy, that does definitely provide some comfort. In addition, Altair does not have that much data transformation logic built in. So even if for whatever reason we would need to remove the dependency on Narwhals again, I think that would be doable in a weekend :) To me the project looks well structured, documented, and the best shot we have. The alternative would be to implement different code paths ourselves.
What is your suggestion?
Currently, PyArrow is required by Altair for Polars support. I think it shouldn't be too hard to remove it, given that Polars implements the dataframe interchange protocol natively (without depending on PyArrow)
If #3384 can make it in, then Altair would actually support plotting Polars dataframe natively without any extra heavy dependencies. That'd be...pretty amazing? I'd suggest using Altair for
polars.DataFrame.plot
if that was the caseI think what would need doing is:
pyarrow
to be installed for thedfi = data.__dataframe__
partsanitize_arrow_table
, for Polars, just select date/datetime columns and call.dt.to_string()
to_pylist
from PyArrow, just useDataFrame.rows(named=True)
for Polarsinfer_vegalite_type_for_dfi_column
. I haven't tried this yet, but it looks straightforward-ishWould you open to considering this? Happy to work on a PR if so
Have you considered any alternative solutions?
Just keep the status-quo :) But, I think Altair is the only plotting library that gets close to native Polars support without extra large dependencies, and it doesn't look like a large stretch to go all the way there, so I'm hoping we can do it 💪
Demo from having tried this locally: