Open OlivierBinette opened 5 months ago
Hi @OlivierBinette 👋🏻!
Thanks for opening an issue about this.
I will write some things down here that can hopefully be used by myself or others to seed a documentation PR.
When and how a lazy Ibis table becomes a view or a table in your backend database.
As for "how", there are generally two methods for doing that:
newly_created_table = con.create_table(name, table_expr)
newly_created_view = con.create_view(name, table_expr)
Both of newly_created_table
and newly_created_view
are Ibis table expressions backed by some named physical entity in the database. At the expression level Ibis doesn't distinguish between a view and a table.
As for "when", that's generally when you invoke those methods.
How Ibis Tables get_name() function returns a value and how this name relates to the physical table name.
This name corresponds the physical table's name only in the case that the table is backed by a physical entity in the database like a table or view. Otherwise it's an ephemeral name.
I would try to avoid depending on the return value of this method being anything in particular for an arbitrary table expression.
While get_name()
is a public API, it's not entirely clear to me whether it's useful. That said, if we were to remove it it would go through a deprecation cycle (1 or 2 minor/patch releases) and only be explicitly removed in a major release.
How you can list materialized tables
con.list_tables()
returns a list of strings, one string for each table or view.
or distinguish between unevaluated Ibis table expressions and database tables.
All Ibis table expressions are unevaluated until you call to_pandas
/to_pyarrow
/to_pyarrow_batches
.
If you're asking how to distinguish between whether something is a leaf in the expression tree or not, there's no public API for doing this right now.
Can you describe what you want to do with this information and/or more about your use case? In practice this hasn't come up IIRC.
How you can go from an Ibis Table to workinig directly with the underlying table reference (either through SQLAlchemy or your database' official Python client).
We don't have a generic API for this, partly because each backend's target representation is different. Some backends work directly with strings and don't have anything analogous to sqlalchemy.Table
(for example).
Can you describe what you're trying to do here?
Thanks for your answers!
Follow-up question:
Does con.list_tables()
return the name of unmaterialized tables in some cases, or of tables registered in a different schema or database? I'm working with DuckDB in particular, and if I'm not mistaken some of the tables listed in listed_tables might be in the temp rather than main. I'm not 100% sure though as I think there has been some changes with list_tables() recently.
Right now, I use a direct connection to my database (using DuckDB's Python client) to query my database and find whether or not a table has been materialized:
def material_tables_list(self):
material_tables_data = self.duckdb_connection.sql("show tables").fetchall()
columns_data = list(zip(*material_tables_data))
if len(columns_data) > 0:
return columns_data[0]
else:
return []
For your final question question:
The pipeline that I use after pre-processing data with Ibis expects the name of a table in my database or a DuckDBPyRelation object. It could possibly accomodate a sqlalchemy.Table class as well. This requires making sure that the input Ibis Table has been materialized and then fetching its physical name.
For more information, here's what I do:
get_name()
function to pass the table name to the secondary pipeline. The output of the pipeline is a new table in my database, and I return an Ibis Table reference to it.Happy to share code if this is too abstract. I could put together a minimal example.
@cpcloud Big picture, I think the key thing is that I am using Ibis Table objects not just for data manipulation, but also as my abstract table interface in my system.
As such, I need to not only be able to create Ibis Tables, but I also need to be able to go from an Ibis Table back to my backend database. The common interface for this might be to materialize a table in a database/schema/namespace of your choice (or even just the default one), and then get the name of the physical table as a string. I think this is mostly supported by Ibis, but the documentation might need to be clarified a bit and that use case could be highlighted.
I think the use @OlivierBinette describes here is nearly the exact same use case that I had in https://github.com/ibis-project/ibis/issues/6636.
I also started #7762 if anyone wants to chime in there.
I think better docs would logically get added to https://ibis-project.org/how-to/input-output/basics#data-platforms. @OlivierBinette what parts of that guide are lacking that we should add?
@NickCrews Maybe the data-platforms page could include a few points about the following:
list_tables()
to fetch table names, making sure to highlight that some of the tables listed there may be in different database schemas (e.g., main and temp for DuckDB), and explaining how exactly all of that is managed.I don't mind whether or not Ibis' Table API directly implements functions to do these things, but I definitely want to have enough information so that I can implement reliable functions that do these things for a specific backend I might be using.
@OlivierBinette I'm having a bit of trouble understanding what you mean by "materialize".
Does that mean creating a table in the database from an expression, or executing an expression (getting back in-memory results), or something else?
@cpcloud By "materialize", I mean creating a persistent table in the database from an expression. I.e., the Ibis Table object should point a table in the backend database. Specifically:
Maybe "persisted" would be a better word here.
Here's how I currently check if a table has been "materialized", using a separate direct connection self.duckdb_connection
to my DuckDB database:
def material_tables_list(self):
material_tables = self.duckdb_connection.sql("show tables").fetchall()
columns_data = list(zip(*material_tables))
if len(columns_data) > 0:
return columns_data[0]
else:
return []
And here's how I create a "material view" for a given table:
def material_view(self, table: Table, table_name: Optional[str] = None, overwrite=False):
if table_name is None:
table_name = table.get_name() # This raises an error if the Ibis table refers to an unnamed expression. A uuid should be created in that case.
if overwrite or table_name not in self.material_tables_list():
return self.db.create_table(table_name, table, overwrite=True)
else:
return self.db.tables[table_name]
Hm,
"material view"
is a confusing way to refer to "disk-backed table", especially since many platforms these days have an object called a "materialized view", which is neither a table nor is it a traditional view 😅
I think we need some documentation that gives some conceptual foundations about what words like "view" and "table" mean. They're independent of ibis, but I think we should all try to get on the same page with respect to database nouns to the extent possible.
For example, both tables and views have the ability to be temporary and overwritten, but only tables have data that takes up storage space.
To clarify a couple things:
This table should have a fixed name/location in the database that can be referenced by other, separate connections to the database.
This is the notion of a non-temporary tabular object in a database, i.e., a thing created with either CREATE VIEW
or CREATE TABLE
, no TEMPORARY
.
The table should be persisted to the database, i.e. not be dropped once Ibis' connection to the database is closed.
This is the same notion as the first bullet point. Unless a view or a table was created with temp=True
(which maps to the TEMPORARY
keyword in SQL), the object will persist beyond the Ibis session.
@cpcloud Thanks for the clarifications! I think what confuses things more is that an Ibis "table" can be a database table, a database view, an unevaluated expression, etc. So ultimately the documentation issue is to clarify the relationship between the Ibis and backend concepts, and how to figure out how one maps to the other.
For my application, I often want to make sure that an Ibis table refers to a non-temporary database table, or to make an Ibis table to a non-temporary database table if that's not already the case.
For my application, I often want to make sure that an Ibis table refers to a non-temporary database table
I would say that there's no need verify this if you're using create_table
. Its result will always be a persistent named table taking up space on disk.
or to make an Ibis table to a non-temporary database table if that's not already the case.
I think create_table
is the answer here as well:
t = con.create_table(name, any_table_expression) # the second argument can be a view or a table
The return value of create_table
called this way will always be a table that persists beyond the ibis session.
@cpcloud
I would say that there's no need verify this if you're using create_table. Its result will always be a persistent named table taking up space on disk.
I have functions that take as input Ibis Table objects, and unfortunately I don't know how the user created them. I'm using Ibis as a high-level Table interface abstraction for a piece of software, but I'm not responsible for supplying the input.
Should I do try/catch on con.create_table
? This function throws an error if the table already exists, and I can overwrite otherwise. However, the table could already exist and be (for DuckDB) in the "temp" database schema instead of the "main" schema.
There's also a cache()
function for tables, but I'm not sure what to expect of it regarding the backend.
Overall, I need a better understanding of how I should inspect Ibis Table objects against their backend.
I have functions that take as input Ibis Table objects, and I don't know how the user created them.
Ah, okay, that helps me understand a bit more about what you're doing.
Should I do try/catch on con.create_table? This function throws an error if the table already exists, and I can overwrite otherwise.
Interesting, we actually had a force
argument that would generate a CREATE TABLE t IF NOT EXISTS
statement which sounds like potentially what you're looking for. The semantics are: create the table as usual if it doesn't exist otherwise return the existing table.
Perhaps we should consider bringing that back, with a better name of course 😄
There's also a cache() function for tables, but I'm not sure what to expect of it regarding the backend.
cache
is explicitly designed to be session-bound: any cached tables are temporary and dropped at the end of the session.
@OlivierBinette Jumping back in here a bit.
Can you elaborate on why you need to know whether the thing underlying the Ibis table expression exists on disk as a non-temporary table?
Generally speaking if you're providing an API to end users that accepts arbitrary Ibis table expressions your API shouldn't need to know details like whether the table exists on disk. Those expressions could be a physical table, or a 400 element UNION ALL
monstrosity.
I'd like to understand the architectural details of your application so that we can get a better idea of whether we need to bring back the CREATE TABLE IF NOT EXISTS
functionality or not.
@cpcloud Yes for sure. I'm building a Python package that will be used by data scientists to define data processing and entity resolution pipelines. The package has pre-built functions that take Ibis Table as input, and return Ibis Tables.
Some of these pre-built functions rely on external tools for more complex data processing, including Splink. To hand off a table to Splink, I need to give Splink the name of the table in the database. I also need to make sure that the table is non-temporary, as Splink relies on its own, separate connection to the database.
So basically, I want to have interoperability between Ibis and other systems that work with the same backend database.
@cpcloud Ideally, I'd be able to achieve the following two things:
I am in a very similar boat with needing to pass off tables back and forth with Splink.
@OlivierBinette Are you sure that you actually need to be able to distinguish, or do you really only care that the result fits your needs? ie if ibis had this function and you didn't know the internals, would that work?
def ensure_named(t: Table, name: str | None = None, temp: bool = True, view: bool = True) -> Table:
"""
Create a temp/permanent view/table in the database if the table isn't already that state.
"""
After this method, it is guaranteed that t.get_name()
will give me a name that splink will accept.
@NickCrews Yes that would cover my use case. I currently have a function that roughly does this for my DuckDB backend, but it's a bit janky.
The create_table
and create_view
functions could also be changed to have a force
argument as cpcloud was saying, although we'd want to make sure that these functions always return Ibis Tables backed by non-temporary tables.
create_table
and create_view
always create objects that persist, unless you pass temp=True
to them.
Please describe the issue
Context:
I'm using Ibis to do data pre-processing, and then using the name of the materialized table in the database as the input to another pipeline.
Problem:
After spending some time working with Ibis and reading through the docs, I'm still unclear about the relationship between Ibis tables, the tables listed in
con.tables
, and the views and tables in my database.Specifically, I think the documentation should clarify:
get_name()
function returns a value and how this name relates to the physical table name.Code of Conduct