Open MarcoGorelli opened 1 month ago
I'm a fan of Ibis and use it regularly, mainly together with https://github.com/binste/dbt-ibis and I find it exciting to see that you're looking at it as a potential layer around duckdb.
However, given that one of the benefits of narwhals is a low overhead, I just wanted to note that all the type validation and abstractions in Ibis do add overhead which can add up. It might not be noticeable when working interactively with larger amounts of data but I had use cases in web applications where I had to rewrite code back from Ibis to pure Duckdb or Pandas due to this.
Here's a quick example, loosely based on this DuckDB-Ibis example. Ibis 9.1, duckdb 1.0.
import ibis
# Create some example database
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)
%%timeit
# reconnect to the persisted database (dropping temp tables)
con = ibis.connect("duckdb://penguins.ddb")
penguins = con.table("penguins")
penguins = (
penguins.filter((penguins.species == "Gentoo") & (penguins.body_mass_g > 6000))
.mutate(bill_length_cm=penguins.bill_length_mm / 10)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"year",
"bill_length_cm",
)
)
penguins_arrow = penguins.to_pyarrow()
19.1 ms ± 482 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Same in pure duckdb:
%%timeit
con = duckdb.connect("penguins.ddb")
penguins = con.table("penguins")
penguins = (
penguins.filter(
(ColumnExpression("species") == ConstantExpression("Gentoo"))
& (ColumnExpression("body_mass_g") > ConstantExpression(6000))
)
.project(
*[ColumnExpression(name) for name in penguins.columns],
(ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
"bill_length_cm"
)
)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"year",
"bill_length_cm",
)
)
penguins_arrow = penguins.arrow()
1.5 ms ± 9.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see https://github.com/ibis-project/ibis/issues/9111, but some overhead might just be inherent.
Just thought you might appreciate the note that it's worth benchmarking first what you want to achieve in case you did not yet have it on your list already anyway :)
Hope you don't mind me commenting here, this issue was linked in our issue tracker.
Just want to offer a small clarification on:
I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see https://github.com/ibis-project/ibis/issues/9111, but some overhead might just be inherent.
Ibis's expressions definitely have some overhead, but that overhead scales relative to the size of your query, not your data. We're doing some work to reduce it in certain cases (as you noted, some operations scaled poorly on wide tables), but for most analytics use cases the overhead (in 10-100 ms in your examples above) is negligible compared to the execution time of the query. Ibis was designed to perform well for data analytics/engineering tasks, with operations on medium -> large data. For small data the overhead will be more noticeable. It definitely wasn't written to compete with SQLAlchemy
-like webapp use cases!
Thanks @jcrist! That summarises better what I was trying to convey and hopefully helps the narwhal maintainers think this one through :)
Thanks both for comments! 🙏
The objective of Narwhals is to be as close as possible to a zero-cost-abstraction, whilst keeping everything pure-Python and without required dependencies
At a minimum, if we were to consider supporting DuckDB by going via Ibis, we would require that this be done without any non-lightweight dependencies and with negligible overhead. I understand that we're not yet in a situation where this is possible, but I was keen to get the conversation started
Anyway, it's nice to see this issue bringing together maintainers of different projects to discuss things 🤗
dask.dataframe
Things are picking up here. It's involving a fair few refactors, which is probably a good thing. If anyone's interested in contributing support for the other libraries mentioned, I'd suggest waiting until Dask support is complete enough (let's say, able to execute tpc-h queries 1-7) - by the time that's happened, adding other backends should be much simpler
If we look at:
then it doesn't seem too hard to support the
narwhals.LazyFrame
API without running into footguns. We may need to disallow operations which operate on multiple columns' row orders independently (the alternative is doing a join on row order behind the scenes, which I'm not too keen, as we'd be making an expensive operation look cheap)Each one can then have an associated eager class, which is what things get transformed to if you call
.collect
:We can start with this, and possibly consider expanding support later. As they say - "in open source, 'no' is temporary, but 'yes' is forever" - doubly so in a library like Narwhals with a stable API policy 😄
Furthermore, we'd like to make it easier to extend Narwhals, so that if library authors disagree with us and want to add complexity which we wouldn't be happy with maintaining, they can be free to extend Narwhals themselves - all they'd need to do is to implement a few dunder-methods