Open gforsyth opened 1 week ago
We're running into some issues with the intended replacement for the previous read_in_memory
functionality in 9.5.0. Here's a quick reproduction of the issue:
import ibis
import ibis.backends.duckdb
import pyarrow as pa
def attach_tables(backend, table):
return backend.create_view("test_table", ibis.memtable(table))
backend = ibis.duckdb.connect()
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
table = pa.Table.from_arrays([n_legs, animals], names=names)
ibis_view = attach_tables(backend, table)
ibis_view.execute() # This will raise an error in 9.5.0, but not in 9.4.0
The above works in 9.4.0 but breaks in 9.5.0 with the following error:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/filip/code/ibis-memtable-reproduction/src/ibis_memtable_reproduction/__main__.py", line 19, in <module>
ibis_view.execute() # This will raise an error in 9.5.0, but not in 9.4.0
^^^^^^^^^^^^^^^^^^^
File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/expr/types/core.py", line 396, in execute
return self._find_backend(use_default=True).execute(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1444, in execute
table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1377, in _to_duckdb_relation
return self.con.sql(sql)
^^^^^^^^^^^^^^^^^
duckdb.duckdb.CatalogException: Catalog Error: Table with name ibis_pyarrow_memtable_r64pwlzyvfhflg2rexyz6hgutq does not exist!
Did you mean "sqlite_temp_master"?
It makes sense why this breaks, given the changes in this (and related) PR: https://github.com/ibis-project/ibis/pull/10055
This does however introduce a breaking change effectively, as in the past the lifecycle of data registered as views in DuckDB was tied to the lifecycle of the DuckDB con instance, and now has to be managed manually, as the memtable
references will get GCd after they go out of scope, regardless of whether they have been registered as views with DuckDB backend or not.
I can work around this by holding on to the references to memtable objects explicitly, but I wonder if there's perhaps someting that could be done in the create_view
method that would allow Ibis to associate an instance of memtable
with the instance of the backend that's using it?
Thanks for that extra context @acuitymd-filip, and sorry about the inadvertent breaking change. We really try to avoid those (without major version bumps), but things do slip through.
I can think of a relatively straightforward way to handle passing the in-memory objects directly to create_view
(without wrapping in memtable
), might be a bit kludgy. I'll take a look.
Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.
Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.
The memtable
still gets GC'd (if defined as it is in the example above) -- this is a non-issue in create_table
because the data are copied.
One thing we could do is if something not an ir.Expr
is passed to create_view
, we can register it via the (private) _read_in_memory dispatcher -- that would create the view of the object without the GC behavior cleaning up the underlying object (no longer a memtable).
It would be tricky, but probably doable to handle overwrite=True
in that context -- I don't know how we can respect custom catalog.database
locations for registration, though
I did some poking at this.
The register
method for creating a view from an existing in-memory Python object always creates a view in the temp
catalog -- it's always temp
and so it always goes in the temp
catalog. So I think we just raise if someone passes in a non ir.Expr
and tries to specify database=
. If they need something persistent, that should go through create_table
anyway, which does handle this -- and if they really need a view in a particular catalog, they can create a view of the view (although this won't survive longer than the existing session because the intermediate temporary view from register
will get dropped)
I'm not sure why naming the memtable and not using create_view
doesn't solve the scoping problem here.
The issue seems to be creating a view from an inline-constructed memtable. If you don't do that and instead name the memtable with the name you'd give to the view, then I don't think there's an issue, because those memtables now must be in scope to use them.
That seems viable, just requires different work (which is fine). We don't support dataset
as inputs to memtable, and there are some semantics we'd have to sort out around consuming / materializing the (potentially remote) data inside the dataset
.
I think, especially, we need to ensure that DuckDB can still push down filters and projections
In #9666 we deprecated the
read_in_memory
method on the DuckDB backend in favor of creating amemtable
or passing an object directly tocreate_table
.This does leave out
pyarrow.dataset.Dataset
s, which we previously supported viaread_in_memory
. They have ato_table()
method, but materializing them may not be ideal, especially since they can point to (lists of) remote resources and DuckDB can (I believe) push filters down into the dataset reads to do column pruning and filtering.We could add a memtable wrapper for them, although the deferred aspect is a bit different from other memtable-able objects.
Whatever we decide on might also apply to pyarrow record-batch-readers.