narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
264 stars 38 forks source link

Define three levels of support #388

Closed MarcoGorelli closed 1 week ago

MarcoGorelli commented 3 weeks ago

Currently all the backends we support implement the full Narwhals API, which includes expressions

But if someone just needs to inspect schemas and some other simple things, it should be possible to do so in a much simpler way

So, we could have ~two~ three levels of support:

The objective of "Metadata" would be to be something like an extended and more user-friendly DataFrame Interchange Protocol. Any library which already supports the interchange protocol would be in-scope

lostmygithubaccount commented 2 weeks ago

@MarcoGorelli the Ibis issue noted as a blocker has been closed since January, is there still any blocker?

also, is the Narwhals API mimicking the polars.DataFrame API or the polars.LazyFrame API?

MarcoGorelli commented 2 weeks ago

has been closed since January, is there still any blocker?

it's closed in the sense that Ibis forbids that operation - so, I'd say yes 😄

also, is the Narwhals API mimicking the polars.DataFrame API or the polars.LazyFrame API?

both, we have both narwhals.DataFrame and narwhals.LazyFrame

I think you can go very easily to PyArrow for Ibis's table though, right? We're ramping up PyArrow.table support, so it should be possible for anyone passing in an Ibis table to go straight to PyArrow.table, and then we can support the remaining operations

lostmygithubaccount commented 2 weeks ago

this is the case where you want to sort a single column, while maintaining the order of the other columns? Ibis/the relational model don't have an ordered index right, so you'd have to model this by splitting into two tables and joining back (i.e. pretty sure mimicking the behavior is possible, people have built the pandas API on Ibis and presumably support this)

wrote an example over on the original issue

are there any other blockers?

going through Arrow tables for larger datasets is going to be slow

MarcoGorelli commented 2 weeks ago

thanks for the example!

In general it's not clear to me how to make Ibis fully compatible - but, I'm working on make Narwhals more pluggable, so then the structure should be:

and narwhals.dataframe.py should only check that an object has a __narwhals_dataframe__ attribute. If we can do that, then it would be a lot easier for you to experiment with Ibis support (if it interests you!): just fork and try adding narwhals/_ibis, or adding __narwhals_dataframe__ directly to ibis.Table, and see where things go

MarcoGorelli commented 1 week ago

'metadata' level if coming along in https://github.com/narwhals-dev/narwhals/pull/517

If we're adding levels, I think a "sql frontend" level could also be in scope, which Ibis and DuckDb could fit into.

This is a reversal of my original stance, and I need to acknowledge this, so sorry for having originally only wanted to stick to full-api-support libraries

lostmygithubaccount commented 1 week ago

If we're adding levels, I think a "sql frontend" level could also be in scope, which Ibis and DuckDb could fit into.

I'm still confused w.r.t. Ibis, are there any APIs in the subset of the Polars APIs (pl.DataFrame and pl.LazyFrame) Narwhals implements that cannot be implemented with the Ibis API?

MarcoGorelli commented 1 week ago

For example:

In [20]: if tbl['b'].mean() > 0:
    ...:     # something

In Polars, that's also unsupported for pl.LazyFrame, but pl.DataFrame allows it. And narwhals.from_native has an eager_only keyword argument. As far as I can tell, Ibis doesn't distinguish between expressions and Series?

For a concrete use case, check how scikit-lego is using Narwhals, especially TimeGapSplit. The source code is here: https://github.com/koaning/scikit-lego/blob/1d9ef4c90e0ac61c0140e8f61be1b897e8afffeb/sklego/model_selection.py#L17-L272

Is it possible to rewrite that using Ibis?

lostmygithubaccount commented 1 week ago

what do you mean with this example?

In [20]: if tbl['b'].mean() > 0:
    ...:     # something

you achieve this with the Ibis API

Ibis doesn't distinguish between expressions and Series

I'm not sure what you mean by this or why exactly it would matter, concepts and behaviors will differ across dataframe libraries right. my understanding is the Narwhals API is to abstract these differences with a subset of the Polars API

I'm fairly confident you could write that scikit-lego code in Ibis (you can, for instance, implement the pandas API on top of Ibis or SQL in general)

MarcoGorelli commented 1 week ago

In the linked code, there's something like this:

date_max = X_index_df["__date__"].max()
# [...]
if current_date + self.train_duration + time_shift + self.gap_duration > date_max:
    break

If I try evaluating this kind of operation with an ibis table, it raises:

In [28]: type(tbl)
Out[28]: ibis.expr.types.relations.Table

In [29]: if 3 > tbl['b'].max():
    ...:     print('here')
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[29], line 1
----> 1 if 3 > tbl['b'].max():
      2     print('here')

File ~/scratch/.venv/lib/python3.11/site-packages/ibis/expr/types/core.py:170, in Expr.__bool__(self)
    169 def __bool__(self) -> bool:
--> 170     raise ValueError("The truth value of an Ibis expression is not defined")

ValueError: The truth value of an Ibis expression is not defined

you can, for instance, implement the pandas API on top of Ibis

Thanks - has this been done? If so, then we can (trivially) reuse narwhals/_pandas_like for that

lostmygithubaccount commented 1 week ago

as I mentioned, the concepts and behaviors will differ -- with Ibis, you'd achieve that like:

[ins] In [1]: import ibis

[ins] In [2]: t = ibis.examples.penguins.fetch()

[nav] In [3]: if t["bill_length_mm"].mean().to_pyarrow().as_py() > 0:
         ...:     print("hello")
hello

though there are a few other options to achieve the same behavior. this is explicit (go through PyArrow to a Python object) and we could put some syntactic sugar around this if that's something people want, but it seems sufficient for abstracting behind the Narwhals API

Thanks - has this been done? If so, then we can (trivially) reuse narwhals/_pandas_like for that

it's tightly coupled to a single backend and has a ton of performance overhead. like Polars, Ibis meaningfully differs from the pandas API and of course does not maintain an ordered index. you can still achieve the same behavior as an ordered index and mimic the pandas API (I'm just using this to explain why I'm fairly confident you can implement the breadth of the Narwhals API using the Ibis API)

MarcoGorelli commented 1 week ago

go through PyArrow to a Python object

sure but then you're automatically triggering computation for the user? If someone does

if t['a'] > 0:
    t = func_1(t)
if t['b'] < t['c']:
    t = func_2(t)

and these comparisons are each automatically triggering .to_pyarrow().as_py(), then doesn't that become potentially extremely expensive?

Side note: automatic implicit computation was one of the major disagreements I had with some of your colleagues during the dataframe consortium project, which spurred me to take several ideas they'd rejected (expressions, separating lazy vs eager apis, to_dict, iterating over groups, an explicit collect, to_numpy, and much more) and spin them out into a separate project which I ended up calling "Narwhals" (and which I've been very careful to never refer to as a "dataframe standard") 😄

lostmygithubaccount commented 1 week ago

I'm back to confused -- the origin of this issue is:

Currently all the backends we support implement the full Narwhals API

and a claim that you can't implement the full Narhwals API with Ibis. there have been 2 examples I've seen, and in both cases you could implement the API with Ibis

and these comparisons are each automatically triggering .to_pyarrow().as_py(), then doesn't that become potentially extremely expensive?

it depends? I'm a little confused about the use case or how you could achieve this without computation, i.e. how would you know the mean of a column without computing it? for performance reasons, you wouldn't write those as Python if/else statements using Ibis (even though you can), you'd rather use ibis.ifelse and/or case statements and potentially UDFs depending on what those functions do. but the point is you can implement it with the Ibis API

Side note: automatic implicit computation was one of the major disagreements I had with some of your colleagues during the dataframe consortium project, which spurred me to take several ideas they'd rejected (expressions, separating lazy vs eager apis, to_dict, iterating over groups, an explicit collect, to_numpy, and much more) and spin them out into a separate project which I ended up calling "Narwhals" (and which I've been very careful to never refer to as a "dataframe standard") 😄

ok. I have never been involved in the dataframe consortium project, my point is that the examples given for the justification of putting Ibis in a different class of support are all implementable with the Ibis API. I don't think I care about automatic implicit computation personally and I'm just commenting as an Ibis contributor

MarcoGorelli commented 1 week ago

how would you know the mean of a column without computing it?

I wouldn't, which is why in that class from scikit-lego they use nw.from_native(..., eager_only=True)

If your table represents a DAG with some very expensive operations, then repeatedly triggering them implicitly risks getting really expensive. Whereas if you're working with an eager dataframe, then there's no risk of repeatedly triggering an expensive read_parquet

in both cases you could implement the API with Ibis

In both cases we would end up doing something expensive under the hood (join in the first case, and collection + conversion to pyarrow in the second)

I don't really want such performance footguns in Narwhals itself - having said that, it is possible to extend Narwhals by implementing __narwhals_dataframe__ / __narwhals_lazyframe__, if you'd be willing to try this out, I'd be really curious to see how it would work

it's tightly coupled to a single backend and has a ton of performance overhead

Right, that's exactly what I feared 😄

lostmygithubaccount commented 1 week ago

I think more assumptions are being made about performance -- I'm not sure why you'd repeatedly trigger a DAG with some very expensive operations. there are numerous ways to cache computations or mimic eager execution fully if you really want to with the Ibis API

I'm not sure what performance you're optimizing for, but my understanding is out-of-core larger-than-RAM datasets will necessitate something like how Ibis with the DuckDB backend works, and I'm pretty sure you'd get performance on-par or better for smaller workloads too

anyway sorry for hijacking this issue, you can obviously do whatever you want. my overall point is I'm reasonably confident the Narwhawls API is fully implementable in a performant manner with Ibis. I do not have the time to work on this but am a little concerned about the repeated assumptions being made; feel free to reach out within the Ibis community if you or anyone are working on this in the future and isn't sure how to accomplish it with Ibis. it can probably done, and it can probably be done performantly

MarcoGorelli commented 1 week ago

Thanks for your inputs, really appreciate it 🙏

it can probably done, and it can probably be done performantly

Sure, then please allow me to rephrase: I'm not saying it can't be done, I'm saying - I can't do it. It sounds intractably complicated to keep track of things and figure out when to do caching, and I don't have that level of skill. And if the pandas api implementation on top of ibis you mentioned has a tonne of performance overhead, it suggests I'm not alone

For now, I'll aim for partial API support - that, at least, I think I have the skills to manage. If things get more traction, we can be more ambitious

lostmygithubaccount commented 1 week ago

if the pandas api implementation on top of ibis you mentioned has a tonne of performance overhead, it suggests I'm not alone

I don't think this is the correct takeaway from my earlier statement :) the fully breadth of the pandas APIs, including all of the inherently inefficient operations that the Polars and Ibis APIs omit, is a lot different than the subset of the Polars API Narwhals implements

MarcoGorelli commented 1 week ago

It's the "automatically figure out when to trigger compute / cache in a Python API" that seems intractably complex to me, not the pandas-specific parts like row labels

lostmygithubaccount commented 1 week ago

It's the "automatically figure out when to trigger compute / cache in a Python API" that seems intractably complex to me

why would you need to do this? I'm still not understanding. in eager mode, you'll always immediately execute, in lazy mode you'll only do it on explicit API calls, right? it would be a great to have a high-level overview of concerns or a concrete example of something in the Narwhals API you couldn't implement performantly with the Ibis API

MarcoGorelli commented 1 week ago

Sure, please take a look at the TimeGapSplit from scikit-lego: https://github.com/koaning/scikit-lego/blob/1d9ef4c90e0ac61c0140e8f61be1b897e8afffeb/sklego/model_selection.py#L17-L272

Perhaps the easiest thing would be if you put together a POC of how you'd wrap Ibis in the Polars API such that it can be used in that class

lostmygithubaccount commented 1 week ago

that is code using the Narwhals API? are there any specific Narwhals operations you're concerned can't be implemented? it looks like they could all be implemented to me. if you can implement the full Narwhals API on Ibis (the subset of the Polars API), won't scikit-lego's code just work?

MarcoGorelli commented 1 week ago

There's no operation which, in isolation, I'm concerned about

But taken together, the fact that the solution you proposed requires converting to pyarrow means that, if I want to avoid repeatedly converting the same dataframe to pyarrow, you're going to have to do some caching

And automatically figuring out when to do caching on behalf of the user strikes me as intractably complex - or at least, something that's beyond my skill-level. But, if anyone could show at least a POC of how it can be done, I'd gladly incorporate it into Narwhals

lostmygithubaccount commented 1 week ago

how would you know the mean of a column without computing it?

I wouldn't, which is why in that class from scikit-lego they use nw.from_native(..., eager_only=True)

If your table represents a DAG with some very expensive operations, then repeatedly triggering them implicitly risks getting really expensive. Whereas if you're working with an eager dataframe, then there's no risk of repeatedly triggering an expensive read_parquet

And automatically figuring out when to do caching on behalf of the user strikes me as intractably complex

I'm not convinced this is something you need to do with how the Narwhals API works, but it's probable I'm missing something -- as I understand it, you're opting into eager or lazy evaluation of expressions. you can easily simulate eager behavior with Ibis if you want, no need for complicating caching mechanics, though of course you're limiting to result sets that fit in-memory (as with any other eager API)

my overarching concern here is misrepresentative statements (or takeaways from my statements) about Ibis predicated on misunderstandings: we've gone from you can't do this with the Ibis API -> it's not performant with the Ibis API -> it's too complicated to manage, with little effort put into the investigations of each claim. if performance footguns are a key concern for Narwhals, I think some effort in understanding options with Ibis and the tradeoffs should be considered before implementation as a third tier backend and claiming that's how it had to be. anyway sorry for taking so much time, again you're free to do whatever you want (and to be clear I'm just one person, my views are not necessarily representative of opinions of any open source projects I work on or companies I'm employed by :) ) -- if anyone in the Narwhals community is interested in implementing the full API for an Ibis backend that's performant, I'm sure the Ibis team is happy to help with any questions!

MarcoGorelli commented 1 week ago

with little effort put into the investigations of each claim

right, that's enough


update: Narwhals 1.1.0 adds "interchange"-level support. Which includes Ibis 🤝 🤗

Increased levels of support may come in future releases

MarcoGorelli commented 1 week ago

Alright, now that we've all slept on this, let's see if we can move forwards amicably 🤗

What rubbed me the wrong way was that I'd put a fair bit of effort into adding "interchange"-level support, which includes Ibis.

Even if this isn't full-API-level support, it's enough to help fix a bug which Ibis users are currently facing when using Altair: https://github.com/vega/altair/issues/3473

Narwhals can already make things better for Ibis users, even if we don't achieve full API level support


I should clarify my concerns around performance. My objective with Narwhals is to make (close-to-)zero-cost-abstraction around libraries. If I was to implement Series.mean for the ibis backend by just doing .to_pyarrow().as_py() each time, then "Ibis via Narwhals" would likely be slower than "just run Ibis directly", and that wouldn't be acceptable to me

That's why I suggested to start with partial API support and to consider going further later if things get more traction - that, at least, I'm reasonably confident I can implement as a zero-cost-abstraction.

To then be told that, no, that's not good enough, I'm misrepresenting things, and if I don't think full API support is realistic, it's because I've put little effort in to understand Ibis - that comes across as unfriendly. Doubly-so after having put in the effort to fix a bug for Ibis users (though, admittedly, you might not have been aware of this last part)

If anyone from Ibis wants to claim that it's possible to implement the full Narwhals on Ibis, as a zero-cost abstraction, then that's fine - but I'd like to invite you to provide the implementation, or at least a POC. You said you "might be missing something", and indeed, I can assure you that if you try to complete what I suggested, you'll realise what

I've removed the Ibis comparison on the README, as I understand you didn't appreciate it - I'm sorry for having posted it. Conversely, I'd also like to ask that folks from Ibis stop asking me "why don't you just use Ibis"

I hope we'll be on good terms going forwards 🤝


One more addendum: if anyone wants to talk anything over, my calendly is here

lostmygithubaccount commented 1 week ago

it's clearly not worth continuing this discussion but as far as I'm concerned things didn't get too heated and we're on perfectly professional terms, happy to collaborate on whatever going forward. again, this is your project and you can do whatever you want. but please don't ascribe the actions/intent of others as my own as you've done for a second time here -- my intent on this issue was to convince you or whoever that you can use Ibis as a performant backend for the Narwhals API. I have tried my best but you clearly disagree, which is fine (and to your point the proof would be an implementation). however from my perspective, I have been met with shifting goalposts, false assumptions about my intent or what I think, mischaracterization of my statements, and now an ask to specifically tell people I might know to not ask you a question you don't like answering

FWIW thanks for fixing a bug Ibis users run into, I fix bugs for Polars users too and would for Narwhals if anything came up that I could handle (though, admittedly, you might not have been aware and I'm not sure why this is something to point out in the context of this discussion)