ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.36k stars 540 forks source link

bug: `read_parquet` and similar methods silently overwrite tables #9199

Open deepyaman opened 2 weeks ago

deepyaman commented 2 weeks ago

What happened?

Here's an example derived from a user issue:

>>> import ibis
>>> t = ibis.examples.penguins.fetch()
>>> t.to_parquet("p1.parquet")
>>> t.rename("2_{name}").to_parquet("p2.parquet")
>>> con = ibis.duckdb.connect()
>>> p1 = con.read_parquet("p1.parquet", table_name="p")
>>> con.tables
Tables
------
- p
>>> p2 = con.read_parquet("p2.parquet", table_name="p")
>>> con.tables  # This won't look good...
Tables
------
- p
>>> p1.join(p2, p1["species"] == p2["2_species"]).execute()  # Ibis compiled the expression, but it's not valid DuckDB, since the table definition got overridden
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/joins.py", line 190, in wrapper
    return method(self._finish(), *args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t2" does not have a column named "species"
>>> p1.join(p2, p1["2_species"] == p2["2_species"]).execute()  # Ibis won't be happy, because it still thinks p1 is the table that isn't renamed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 775, in __getitem__
    values = self.bind(args)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 227, in bind
    values.extend(bind(self, arg))
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 98, in bind
    yield ops.Field(table, value).to_expr()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/bases.py", line 72, in __call__
    return cls.__create__(*args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/grounds.py", line 120, in __create__
    return super().__create__(**kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/operations/relations.py", line 86, in __init__
    raise IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column '2_species' is not found in table. Existing columns: 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'.
>>> p1["species"].execute()  # Even simple stuff will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t0" does not have a column named "species"

Shouldn't read_parquet, etc. just have an overwrite option? If you don't specify that, it should not let you clobber an existing table.

What version of ibis are you using?

9.0.0 (initially reported on 8.0.0)

What backend(s) are you using, if any?

DuckDB, but can be replicated on other backends

Relevant log output

No response

Code of Conduct

gforsyth commented 2 weeks ago

This does seem like a potential footgun.

I think with DuckDB, at least, read parquet wouldn't overwrite an existing table, but read_parquet is creating views and we're likely doing a create or replace.

I think an overwrite kwarg is a good idea

gforsyth commented 1 week ago

Just to spell this out concretely:

There should be an overwrite kwarg to read_parquet (and read_csv, read_json, etc...) that defaults to False.

In terms of creating tables / views, we should default to CREATE VIEW name AS ... and only throw in a CREATE OR REPLACE if overwrite=True