ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.36k stars 540 forks source link

docs: Unclear how to materialize a table or list materialized tables #7742

Open OlivierBinette opened 5 months ago

OlivierBinette commented 5 months ago

Please describe the issue

Context:

I'm using Ibis to do data pre-processing, and then using the name of the materialized table in the database as the input to another pipeline.

Problem:

After spending some time working with Ibis and reading through the docs, I'm still unclear about the relationship between Ibis tables, the tables listed in con.tables, and the views and tables in my database.

Specifically, I think the documentation should clarify:

Code of Conduct

cpcloud commented 5 months ago

Hi @OlivierBinette 👋🏻!

Thanks for opening an issue about this.

I will write some things down here that can hopefully be used by myself or others to seed a documentation PR.

When and how a lazy Ibis table becomes a view or a table in your backend database.

As for "how", there are generally two methods for doing that:

Both of newly_created_table and newly_created_view are Ibis table expressions backed by some named physical entity in the database. At the expression level Ibis doesn't distinguish between a view and a table.

As for "when", that's generally when you invoke those methods.

How Ibis Tables get_name() function returns a value and how this name relates to the physical table name.

This name corresponds the physical table's name only in the case that the table is backed by a physical entity in the database like a table or view. Otherwise it's an ephemeral name.

I would try to avoid depending on the return value of this method being anything in particular for an arbitrary table expression.

While get_name() is a public API, it's not entirely clear to me whether it's useful. That said, if we were to remove it it would go through a deprecation cycle (1 or 2 minor/patch releases) and only be explicitly removed in a major release.

How you can list materialized tables

con.list_tables() returns a list of strings, one string for each table or view.

or distinguish between unevaluated Ibis table expressions and database tables.

All Ibis table expressions are unevaluated until you call to_pandas/to_pyarrow/to_pyarrow_batches.

If you're asking how to distinguish between whether something is a leaf in the expression tree or not, there's no public API for doing this right now.

Can you describe what you want to do with this information and/or more about your use case? In practice this hasn't come up IIRC.

How you can go from an Ibis Table to workinig directly with the underlying table reference (either through SQLAlchemy or your database' official Python client).

We don't have a generic API for this, partly because each backend's target representation is different. Some backends work directly with strings and don't have anything analogous to sqlalchemy.Table (for example).

Can you describe what you're trying to do here?

OlivierBinette commented 5 months ago

Thanks for your answers!

Follow-up question:

Does con.list_tables() return the name of unmaterialized tables in some cases, or of tables registered in a different schema or database? I'm working with DuckDB in particular, and if I'm not mistaken some of the tables listed in listed_tables might be in the temp rather than main. I'm not 100% sure though as I think there has been some changes with list_tables() recently.

Right now, I use a direct connection to my database (using DuckDB's Python client) to query my database and find whether or not a table has been materialized:

def material_tables_list(self):
        material_tables_data = self.duckdb_connection.sql("show tables").fetchall()
        columns_data = list(zip(*material_tables_data))
        if len(columns_data) > 0:
            return columns_data[0]
        else:
            return []

For your final question question:

The pipeline that I use after pre-processing data with Ibis expects the name of a table in my database or a DuckDBPyRelation object. It could possibly accomodate a sqlalchemy.Table class as well. This requires making sure that the input Ibis Table has been materialized and then fetching its physical name.

For more information, here's what I do:

  1. I have data processing functions that take an Ibis Table as an input and return an Ibis Table as output.
  2. Some of these data processing functions rely on a secondary pipeline which takes a table name (and a connection to my database) as arguments. In these functions, I check whether or not the input Ibis Table has been materialized, materialize it if not, and then use the get_name() function to pass the table name to the secondary pipeline. The output of the pipeline is a new table in my database, and I return an Ibis Table reference to it.
  3. Data scientists organize these data processing functions into a pipeline, by specifying dependencies across steps. An orchestration tool then runs the pipeline, making sure that relevant intermediate outputs are both materialized to the local database and saved to a data warehouse used for persistent storage.

Happy to share code if this is too abstract. I could put together a minimal example.

OlivierBinette commented 5 months ago

@cpcloud Big picture, I think the key thing is that I am using Ibis Table objects not just for data manipulation, but also as my abstract table interface in my system.

As such, I need to not only be able to create Ibis Tables, but I also need to be able to go from an Ibis Table back to my backend database. The common interface for this might be to materialize a table in a database/schema/namespace of your choice (or even just the default one), and then get the name of the physical table as a string. I think this is mostly supported by Ibis, but the documentation might need to be clarified a bit and that use case could be highlighted.

NickCrews commented 5 months ago

I think the use @OlivierBinette describes here is nearly the exact same use case that I had in https://github.com/ibis-project/ibis/issues/6636.

I also started #7762 if anyone wants to chime in there.

I think better docs would logically get added to https://ibis-project.org/how-to/input-output/basics#data-platforms. @OlivierBinette what parts of that guide are lacking that we should add?

OlivierBinette commented 5 months ago

@NickCrews Maybe the data-platforms page could include a few points about the following:

I don't mind whether or not Ibis' Table API directly implements functions to do these things, but I definitely want to have enough information so that I can implement reliable functions that do these things for a specific backend I might be using.

cpcloud commented 5 months ago

@OlivierBinette I'm having a bit of trouble understanding what you mean by "materialize".

Does that mean creating a table in the database from an expression, or executing an expression (getting back in-memory results), or something else?

OlivierBinette commented 5 months ago

@cpcloud By "materialize", I mean creating a persistent table in the database from an expression. I.e., the Ibis Table object should point a table in the backend database. Specifically:

Maybe "persisted" would be a better word here.

Here's how I currently check if a table has been "materialized", using a separate direct connection self.duckdb_connection to my DuckDB database:

def material_tables_list(self):
        material_tables = self.duckdb_connection.sql("show tables").fetchall()
        columns_data = list(zip(*material_tables))
        if len(columns_data) > 0:
            return columns_data[0]
        else:
            return []

And here's how I create a "material view" for a given table:

def material_view(self, table: Table, table_name: Optional[str] = None, overwrite=False):
        if table_name is None:
            table_name = table.get_name() # This raises an error if the Ibis table refers to an unnamed expression. A uuid should be created in that case.

        if overwrite or table_name not in self.material_tables_list():
            return self.db.create_table(table_name, table, overwrite=True)
        else:
            return self.db.tables[table_name]
cpcloud commented 5 months ago

Hm,

"material view"

is a confusing way to refer to "disk-backed table", especially since many platforms these days have an object called a "materialized view", which is neither a table nor is it a traditional view 😅

I think we need some documentation that gives some conceptual foundations about what words like "view" and "table" mean. They're independent of ibis, but I think we should all try to get on the same page with respect to database nouns to the extent possible.

For example, both tables and views have the ability to be temporary and overwritten, but only tables have data that takes up storage space.

cpcloud commented 5 months ago

To clarify a couple things:

This table should have a fixed name/location in the database that can be referenced by other, separate connections to the database.

This is the notion of a non-temporary tabular object in a database, i.e., a thing created with either CREATE VIEW or CREATE TABLE, no TEMPORARY.

The table should be persisted to the database, i.e. not be dropped once Ibis' connection to the database is closed.

This is the same notion as the first bullet point. Unless a view or a table was created with temp=True (which maps to the TEMPORARY keyword in SQL), the object will persist beyond the Ibis session.

OlivierBinette commented 5 months ago

@cpcloud Thanks for the clarifications! I think what confuses things more is that an Ibis "table" can be a database table, a database view, an unevaluated expression, etc. So ultimately the documentation issue is to clarify the relationship between the Ibis and backend concepts, and how to figure out how one maps to the other.

For my application, I often want to make sure that an Ibis table refers to a non-temporary database table, or to make an Ibis table to a non-temporary database table if that's not already the case.

cpcloud commented 5 months ago

For my application, I often want to make sure that an Ibis table refers to a non-temporary database table

I would say that there's no need verify this if you're using create_table. Its result will always be a persistent named table taking up space on disk.

or to make an Ibis table to a non-temporary database table if that's not already the case.

I think create_table is the answer here as well:

t = con.create_table(name, any_table_expression)  # the second argument can be a view or a table

The return value of create_table called this way will always be a table that persists beyond the ibis session.

OlivierBinette commented 5 months ago

@cpcloud

I would say that there's no need verify this if you're using create_table. Its result will always be a persistent named table taking up space on disk.

I have functions that take as input Ibis Table objects, and unfortunately I don't know how the user created them. I'm using Ibis as a high-level Table interface abstraction for a piece of software, but I'm not responsible for supplying the input.

Should I do try/catch on con.create_table? This function throws an error if the table already exists, and I can overwrite otherwise. However, the table could already exist and be (for DuckDB) in the "temp" database schema instead of the "main" schema.

There's also a cache() function for tables, but I'm not sure what to expect of it regarding the backend.

Overall, I need a better understanding of how I should inspect Ibis Table objects against their backend.

cpcloud commented 5 months ago

I have functions that take as input Ibis Table objects, and I don't know how the user created them.

Ah, okay, that helps me understand a bit more about what you're doing.

Should I do try/catch on con.create_table? This function throws an error if the table already exists, and I can overwrite otherwise.

Interesting, we actually had a force argument that would generate a CREATE TABLE t IF NOT EXISTS statement which sounds like potentially what you're looking for. The semantics are: create the table as usual if it doesn't exist otherwise return the existing table.

Perhaps we should consider bringing that back, with a better name of course 😄

There's also a cache() function for tables, but I'm not sure what to expect of it regarding the backend.

cache is explicitly designed to be session-bound: any cached tables are temporary and dropped at the end of the session.

cpcloud commented 5 months ago

@OlivierBinette Jumping back in here a bit.

Can you elaborate on why you need to know whether the thing underlying the Ibis table expression exists on disk as a non-temporary table?

Generally speaking if you're providing an API to end users that accepts arbitrary Ibis table expressions your API shouldn't need to know details like whether the table exists on disk. Those expressions could be a physical table, or a 400 element UNION ALL monstrosity.

I'd like to understand the architectural details of your application so that we can get a better idea of whether we need to bring back the CREATE TABLE IF NOT EXISTS functionality or not.

OlivierBinette commented 5 months ago

@cpcloud Yes for sure. I'm building a Python package that will be used by data scientists to define data processing and entity resolution pipelines. The package has pre-built functions that take Ibis Table as input, and return Ibis Tables.

Some of these pre-built functions rely on external tools for more complex data processing, including Splink. To hand off a table to Splink, I need to give Splink the name of the table in the database. I also need to make sure that the table is non-temporary, as Splink relies on its own, separate connection to the database.

So basically, I want to have interoperability between Ibis and other systems that work with the same backend database.

OlivierBinette commented 5 months ago

@cpcloud Ideally, I'd be able to achieve the following two things:

  1. Distinguish an Ibis Table that simply points to a database view/table from an Ibis Table that represents some other expression.
  2. For an Ibis Table that points to a database view/table, to be able to get information about the view/table, such as its name, and whether or not it is a temporary table.
NickCrews commented 5 months ago

I am in a very similar boat with needing to pass off tables back and forth with Splink.

@OlivierBinette Are you sure that you actually need to be able to distinguish, or do you really only care that the result fits your needs? ie if ibis had this function and you didn't know the internals, would that work?

def ensure_named(t: Table, name: str | None = None, temp: bool = True, view: bool = True) -> Table:
    """
    Create a temp/permanent view/table in the database if the table isn't already that state.
    """

After this method, it is guaranteed that t.get_name() will give me a name that splink will accept.

OlivierBinette commented 5 months ago

@NickCrews Yes that would cover my use case. I currently have a function that roughly does this for my DuckDB backend, but it's a bit janky.

The create_table and create_view functions could also be changed to have a force argument as cpcloud was saying, although we'd want to make sure that these functions always return Ibis Tables backed by non-temporary tables.

cpcloud commented 5 months ago

create_table and create_view always create objects that persist, unless you pass temp=True to them.