feat: add "lazy-only" level of support

MarcoGorelli commented 1 month ago

If we look at:

dask.dataframe
ibis (personally I'd prefer to wait for them to complete the removal of pandas+pyarrow as required dependencies for all backends)
duckdb (either via Ibis, if https://github.com/ibis-project/ibis/issues/9629 is addressed, or directly via their relational API)

then it doesn't seem too hard to support the narwhals.LazyFrame API without running into footguns. We may need to disallow operations which operate on multiple columns' row orders independently (the alternative is doing a join on row order behind the scenes, which I'm not too keen, as we'd be making an expensive operation look cheap)

Each one can then have an associated eager class, which is what things get transformed to if you call .collect:

dask goes to pandas
ibis goes to pyarrow
duckdb goes to pyarrow

We can start with this, and possibly consider expanding support later. As they say - "in open source, 'no' is temporary, but 'yes' is forever" - doubly so in a library like Narwhals with a stable API policy 😄

Furthermore, we'd like to make it easier to extend Narwhals, so that if library authors disagree with us and want to add complexity which we wouldn't be happy with maintaining, they can be free to extend Narwhals themselves - all they'd need to do is to implement a few dunder-methods

binste commented 1 month ago

I'm a fan of Ibis and use it regularly, mainly together with https://github.com/binste/dbt-ibis and I find it exciting to see that you're looking at it as a potential layer around duckdb.

However, given that one of the benefits of narwhals is a low overhead, I just wanted to note that all the type validation and abstractions in Ibis do add overhead which can add up. It might not be noticeable when working interactively with larger amounts of data but I had use cases in web applications where I had to rewrite code back from Ibis to pure Duckdb or Pandas due to this.

Here's a quick example, loosely based on this DuckDB-Ibis example. Ibis 9.1, duckdb 1.0.

import ibis

# Create some example database
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
    "penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)

%%timeit
# reconnect to the persisted database (dropping temp tables)
con = ibis.connect("duckdb://penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter((penguins.species == "Gentoo") & (penguins.body_mass_g > 6000))
    .mutate(bill_length_cm=penguins.bill_length_mm / 10)
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.to_pyarrow()

19.1 ms ± 482 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Same in pure duckdb:

%%timeit
con = duckdb.connect("penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter(
        (ColumnExpression("species") == ConstantExpression("Gentoo"))
        & (ColumnExpression("body_mass_g") > ConstantExpression(6000))
    )
    .project(
        *[ColumnExpression(name) for name in penguins.columns],
        (ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
            "bill_length_cm"
        )
    )
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.arrow()

1.5 ms ± 9.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see https://github.com/ibis-project/ibis/issues/9111, but some overhead might just be inherent.

Just thought you might appreciate the note that it's worth benchmarking first what you want to achieve in case you did not yet have it on your list already anyway :)

jcrist commented 1 month ago

Hope you don't mind me commenting here, this issue was linked in our issue tracker.

Just want to offer a small clarification on:

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see https://github.com/ibis-project/ibis/issues/9111, but some overhead might just be inherent.

Ibis's expressions definitely have some overhead, but that overhead scales relative to the size of your query, not your data. We're doing some work to reduce it in certain cases (as you noted, some operations scaled poorly on wide tables), but for most analytics use cases the overhead (in 10-100 ms in your examples above) is negligible compared to the execution time of the query. Ibis was designed to perform well for data analytics/engineering tasks, with operations on medium -> large data. For small data the overhead will be more noticeable. It definitely wasn't written to compete with SQLAlchemy-like webapp use cases!

binste commented 1 month ago

Thanks @jcrist! That summarises better what I was trying to convey and hopefully helps the narwhal maintainers think this one through :)

MarcoGorelli commented 1 month ago

Thanks both for comments! 🙏

The objective of Narwhals is to be as close as possible to a zero-cost-abstraction, whilst keeping everything pure-Python and without required dependencies

At a minimum, if we were to consider supporting DuckDB by going via Ibis, we would require that this be done without any non-lightweight dependencies and with negligible overhead. I understand that we're not yet in a situation where this is possible, but I was keen to get the conversation started

Anyway, it's nice to see this issue bringing together maintainers of different projects to discuss things 🤗

MarcoGorelli commented 1 month ago

dask.dataframe

Things are picking up here. It's involving a fair few refactors, which is probably a good thing. If anyone's interested in contributing support for the other libraries mentioned, I'd suggest waiting until Dask support is complete enough (let's say, able to execute tpc-h queries 1-7) - by the time that's happened, adding other backends should be much simpler

narwhals-dev / narwhals

feat: add "lazy-only" level of support #566