ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 586 forks source link

feat(duckdb): restore or add new method for consuming `pyarrow.dataset` #10082

Open gforsyth opened 1 week ago

gforsyth commented 1 week ago

In #9666 we deprecated the read_in_memory method on the DuckDB backend in favor of creating a memtable or passing an object directly to create_table.

This does leave out pyarrow.dataset.Datasets, which we previously supported via read_in_memory. They have a to_table() method, but materializing them may not be ideal, especially since they can point to (lists of) remote resources and DuckDB can (I believe) push filters down into the dataset reads to do column pruning and filtering.

We could add a memtable wrapper for them, although the deferred aspect is a bit different from other memtable-able objects.
Whatever we decide on might also apply to pyarrow record-batch-readers.

acuitymd-filip commented 1 week ago

We're running into some issues with the intended replacement for the previous read_in_memory functionality in 9.5.0. Here's a quick reproduction of the issue:

import ibis
import ibis.backends.duckdb
import pyarrow as pa

def attach_tables(backend, table):
    return backend.create_view("test_table", ibis.memtable(table))

backend = ibis.duckdb.connect()
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
table = pa.Table.from_arrays([n_legs, animals], names=names)

ibis_view = attach_tables(backend, table)
ibis_view.execute()  # This will raise an error in 9.5.0, but not in 9.4.0

The above works in 9.4.0 but breaks in 9.5.0 with the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/filip/code/ibis-memtable-reproduction/src/ibis_memtable_reproduction/__main__.py", line 19, in <module>
    ibis_view.execute()  # This will raise an error in 9.5.0, but not in 9.4.0
    ^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/expr/types/core.py", line 396, in execute
    return self._find_backend(use_default=True).execute(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1444, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1377, in _to_duckdb_relation
    return self.con.sql(sql)
           ^^^^^^^^^^^^^^^^^
duckdb.duckdb.CatalogException: Catalog Error: Table with name ibis_pyarrow_memtable_r64pwlzyvfhflg2rexyz6hgutq does not exist!
Did you mean "sqlite_temp_master"?

It makes sense why this breaks, given the changes in this (and related) PR: https://github.com/ibis-project/ibis/pull/10055

This does however introduce a breaking change effectively, as in the past the lifecycle of data registered as views in DuckDB was tied to the lifecycle of the DuckDB con instance, and now has to be managed manually, as the memtable references will get GCd after they go out of scope, regardless of whether they have been registered as views with DuckDB backend or not.

I can work around this by holding on to the references to memtable objects explicitly, but I wonder if there's perhaps someting that could be done in the create_view method that would allow Ibis to associate an instance of memtable with the instance of the backend that's using it?

gforsyth commented 1 week ago

Thanks for that extra context @acuitymd-filip, and sorry about the inadvertent breaking change. We really try to avoid those (without major version bumps), but things do slip through.

I can think of a relatively straightforward way to handle passing the in-memory objects directly to create_view (without wrapping in memtable), might be a bit kludgy. I'll take a look.

cpcloud commented 1 week ago

Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.

gforsyth commented 6 days ago

Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.

The memtable still gets GC'd (if defined as it is in the example above) -- this is a non-issue in create_table because the data are copied.

One thing we could do is if something not an ir.Expr is passed to create_view, we can register it via the (private) _read_in_memory dispatcher -- that would create the view of the object without the GC behavior cleaning up the underlying object (no longer a memtable).

It would be tricky, but probably doable to handle overwrite=True in that context -- I don't know how we can respect custom catalog.database locations for registration, though

gforsyth commented 1 day ago

I did some poking at this.

The register method for creating a view from an existing in-memory Python object always creates a view in the temp catalog -- it's always temp and so it always goes in the temp catalog. So I think we just raise if someone passes in a non ir.Expr and tries to specify database=. If they need something persistent, that should go through create_table anyway, which does handle this -- and if they really need a view in a particular catalog, they can create a view of the view (although this won't survive longer than the existing session because the intermediate temporary view from register will get dropped)

cpcloud commented 1 day ago

I'm not sure why naming the memtable and not using create_view doesn't solve the scoping problem here.

The issue seems to be creating a view from an inline-constructed memtable. If you don't do that and instead name the memtable with the name you'd give to the view, then I don't think there's an issue, because those memtables now must be in scope to use them.

gforsyth commented 1 day ago

That seems viable, just requires different work (which is fine). We don't support dataset as inputs to memtable, and there are some semantics we'd have to sort out around consuming / materializing the (potentially remote) data inside the dataset.

I think, especially, we need to ensure that DuckDB can still push down filters and projections