vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.06k stars 772 forks source link

Remove PyArrow dependency for Polars support #3445

Open MarcoGorelli opened 1 week ago

MarcoGorelli commented 1 week ago

What is your suggestion?

Currently, PyArrow is required by Altair for Polars support. I think it shouldn't be too hard to remove it, given that Polars implements the dataframe interchange protocol natively (without depending on PyArrow)

If #3384 can make it in, then Altair would actually support plotting Polars dataframe natively without any extra heavy dependencies. That'd be...pretty amazing? I'd suggest using Altair for polars.DataFrame.plot if that was the case

I think what would need doing is:

Would you open to considering this? Happy to work on a PR if so

Have you considered any alternative solutions?

Just keep the status-quo :) But, I think Altair is the only plotting library that gets close to native Polars support without extra large dependencies, and it doesn't look like a large stretch to go all the way there, so I'm hoping we can do it 💪


Demo from having tried this locally:

image

dangotbanned commented 1 week ago

Not a maintainer here, but #3426, #3384 and this all landing in the next release sounds great to me - if all goes well.

Found some additional pyarrow references in data.py, but most lead here: https://github.com/vega/altair/blob/76a9ce1e27a21359e7b0ef9e6cb865f8d95fc298/altair/utils/data.py#L406-L408

Also, if this ends up being a PR and only changes a few files - it may be worth checking https://github.com/vega/altair/pull/3431/files since there are a lot of minor changes that may have simplified/standardized what is on main.

mattijn commented 1 week ago

Thanks for opening this issue! I'm surely in favor of improving native polars support without relying on a pyarrow dependency. As you have noticed, we still need to distill existing pandas serialization logic to become more dataframe-agnostic.

If I understand right, the suggested change here is to introduce a direct link to polars for a polars DataFrame if that object is compatible with the dataframe interchange protocol?

Eg, some changes here: https://github.com/vega/altair/blob/76a9ce1e27a21359e7b0ef9e6cb865f8d95fc298/altair/utils/data.py#L406-L408

And here: https://github.com/vega/altair/blob/76a9ce1e27a21359e7b0ef9e6cb865f8d95fc298/altair/utils/data.py#L406-L416

The introduction of the dataframe interchange protocol is not only for polars, but others are using it too, eg ibis ( https://github.com/vega/altair/issues/3110), or custom objects as in https://github.com/vega/vegafusion/issues/386.

I'm not really in favor of introducing an optional polars dependency, but in line with https://github.com/vega/altair/pull/3377, we are already working towards a direction where we can use direct arrow conversion methods without using the pyarrow from_dataframe interface. If we can improve this direction even more so that pyarrow is not a hard dependency anymore within the part that serialize objects that are compatible with the dataframe interchange protocol that would be great.

Simultaneously, I also noticed that vegafusion has recently introduced an alternative approach for serializing objects that are compatible with polars, see eg this part:

https://github.com/vega/vegafusion/blob/007bd44188676de7259bc02a61693b3dc7586072/python/vegafusion/vegafusion/runtime.py#L218-L225

Anyways! PRs in this direction are welcome!

MarcoGorelli commented 1 week ago

Thanks for your response! 🙏

I think there are some methods which aren't covered by the interchange protocol unfortunately, such as converting a Date / Datetime column to string. So, with regards to

I'm not really in favor of introducing a optional polars dependency

would you be OK with having a couple of specialised Polars branches, where you can do something like

if (pl := sys.modules.get('polars') is not None) and isinstance(df, pl.DataFrame):
    import polars.selectors as cs
    df = df.with_columns(cs.date().dt.to_string('%F'))
    return df

? I think the diff should be quite minimal

I think there's only 2 places where that would be necessary. Other parts where you use the dataframe interchange protocol are fine, it's just the part where you convert to a pyarrow table which could be skipped for the Polars case

Anyways! PRs in this direction are welcome!

Thanks! I'll put something together to drive this forwards, just wanted to check that you'd be OK with a couple (just a couple!) of specialised Polars paths

mattijn commented 1 week ago

First, we try to solve things pragmatic if the dataframe interchange protocol is incomplete. There is not always a royal road. But for my own understanding, isn't this what we are doing already in https://github.com/vega/altair/pull/3377, where we use a direct to_arrow method on a dataframe that supports the dataframe interchange protocol?

MarcoGorelli commented 1 week ago

Yeah, and to_arrow requires PyArrow as a dependency :) In the Polars case, Altair is converting Polars to PyArrow, to do things which are already possible (and fairly easy) to do in Polars directly. And that feels like a missed opportunity to keep Altair really lightweight for Polars users who otherwise wouldn't need to bloat their environment up with PyArrow (which is a huge dependency, and not generally required for working with Polars)

mattijn commented 1 week ago

Ah.. therefor raising this issue:)

So the sanitize_dataframe function: https://github.com/vega/altair/blob/62ab14dcedb52f0a32c8de613c4e6883bdba8796/altair/utils/core.py#L309-L310 should be renamed to sanitize_pandas_dataframe

And next to sanitize_arrow_table https://github.com/vega/altair/blob/62ab14dcedb52f0a32c8de613c4e6883bdba8796/altair/utils/core.py#L434-L437

Will come a new sanitize_polars_dataframe function, with a couple of specialised Polars paths?

I agree that pyarrow is a rather large dependency and we hope to stay lightweight, especially for wasm environments.

MarcoGorelli commented 1 week ago

Yup, thanks for appreciating the value in this! Alright, PR incoming this week

jonmmease commented 1 week ago

Hi @MarcoGorelli, just catching up on this thread and on your and @dangotbanned's comments in https://github.com/vega/altair/pull/3384.

I'm all in favor of streamlining polars support in Altair, but I'm less enthused about having polars specific logic scattered around the code base. On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals? Then ideally over time we could move the other DataFrame types that narwhals supports behind this code path as well.

cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.

MarcoGorelli commented 1 week ago

Thanks for your response!

On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals?

Well, I'm not going to make you ask me twice 😄 Thanks, I'll give this a go and open a PR for your consideration! I don't think it'll be too big of an effort, if you then decide you don't want it, I promise no offense will be taken 😇

binste commented 1 week ago

Just browsed through the narwhals docs, sounds very interesting! I'm also onboard to allow Polars users to use Altair without the need to have PyArrow or Pandas alongside it.

@MarcoGorelli Thanks for offering to open a PR! Just for you to know, https://github.com/vega/altair/pull/3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.

binste commented 1 week ago

@MarcoGorelli Altair supports Pandas >= 0.25. I think it would be ok to bump this to 1.1.5 in your PR to be in line with Narwhal.

Btw, really enjoyed your talk at PyData Berlin in April :)

binste commented 1 week ago

@MarcoGorelli Thanks for offering to open a PR! Just for you to know, #3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.

Just fyi, #3431 is merged.

mattijn commented 1 week ago

cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.

We like it when Altair is lightweight and dataframe agnostic.

We have experimented with adopting the dataframe interchange protocol through relying on pyarrow.interchange.from_dataframe to reach that goal.

Eventhough the interface protocol brought us a lot in relation to becoming dataframe agnostic, but not all (ref #3377) and unfortunately pyarrow is not a library that makes Altair lightweight.

If there are better options to become dataframe agnostic and lightweight by not having hard dependencies on heavy dataframe packages, than it seems like something to consider seriously.

If we found out that column type inference and data serialization for objects that are like dataframes can be delegated safely to narwhals, than it at least is something we should consider as a potential candidate.

I say, 'like dataframes', because I don't think narwhals will supports arrays is it? (something we eventually hope to cover one day). Surely we aim not to introduce regressions during this consideration.

MarcoGorelli commented 3 days ago

Could we jump straight to supporting polars without pyarrow using narwhals?

Here we go 🚀 https://github.com/vega/altair/pull/3452

I totally agree that it's an imperative that this would come with zero regressions. I'm not aware of any:

I don't think narwhals will supports arrays is it?

That's right, arrays are a different beast - I'd suggest taking a look at https://data-apis.org/array-api/latest/ for that. Array libraries were sufficiently aligned to begin with that a standard like that was possible and successful, whereas for dataframes the related effort was discontinued (and that's how Narwhals was born - I'm not calling it a "standard", but I am trying to use my position as pandas + Polars dev to enable libraries to support the latter at no cost to the former)


I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)

binste commented 3 days ago

I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)

  • Regarding maintainability, there are other major libraries considering using Narwhals (scikit-learn, shiny), so I think there's enough interested parties such that, if we set up a good governance model, it could thrive even if I disappeared. I've already given write and admin access to some people I trust and who have made really solid contributions
  • Regarding breaking changes: we're aiming for a "perfect backwards compatibility" policy in Narwhals, inspired by Rust Editions, see this RFC

Thanks for addressing this proactively. If other major libraries use it and if it has such a compatibility policy, that does definitely provide some comfort. In addition, Altair does not have that much data transformation logic built in. So even if for whatever reason we would need to remove the dependency on Narwhals again, I think that would be doable in a weekend :) To me the project looks well structured, documented, and the best shot we have. The alternative would be to implement different code paths ourselves.