ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.3k stars 596 forks source link

feat: Better error when creating table with conflicting names due to capitalization #10217

Closed NickCrews closed 2 weeks ago

NickCrews commented 1 month ago

Is your feature request related to a problem?

Ibis treats columns named "a" and "A" differently, and you can have both in a relation. I think this makes, some backends support this.

However, there are some sharp edges when you try to use this with a backend that DOESN'T distinguish between "A" and "a", such as duckdb:

import ibis

t = ibis.memtable({"a": [1, 2, 3]})
t = t.mutate(A=t.a)
t.cache()
Details ```python-traceback --------------------------------------------------------------------------- CatalogException Traceback (most recent call last) Cell In[2], line 7 5 t = ibis.memtable({"a": [1, 2, 3]}) 6 t = t.mutate(A=t.a) ----> 7 t.cache() File ~/code/ibis/ibis/expr/types/relations.py:3531, in Table.cache(self) 3458 """Cache the provided expression. 3459 3460 All subsequent operations on the returned expression will be performed (...) 3528 └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴───┘ 3529 """ 3530 current_backend = self._find_backend(use_default=True) -> 3531 return current_backend._cached_table(self) File ~/code/ibis/ibis/backends/__init__.py:772, in CacheHandler._cached_table(self, table) 770 entry = self._cache_op_to_entry.get(table.op()) 771 if entry is None or (cached_op := entry.cached_op_ref()) is None: --> 772 cached_op = self._create_cached_table(util.gen_name("cached"), table).op() 773 entry = CacheEntry( 774 table.op(), 775 weakref.ref(cached_op), (...) 778 ), 779 ) 780 self._cache_op_to_entry[table.op()] = entry File ~/code/ibis/ibis/backends/__init__.py:805, in CacheHandler._create_cached_table(self, name, expr) 804 def _create_cached_table(self, name: str, expr: ir.Table) -> ir.Table: --> 805 return self.create_table(name, expr, schema=expr.schema(), temp=True) File ~/code/ibis/ibis/backends/duckdb/__init__.py:197, in Backend.create_table(self, name, obj, schema, database, temp, overwrite) 193 # This is the same table as initial_table unless overwrite == True 194 final_table = sg.table( 195 name, catalog=catalog, db=database, quoted=self.compiler.quoted 196 ) --> 197 with self._safe_raw_sql(create_stmt) as cur: 198 if query is not None: 199 insert_stmt = sge.insert(query, into=initial_table).sql(self.name) File ~/.pyenv/versions/3.11.6/lib/python3.11/contextlib.py:137, in _GeneratorContextManager.__enter__(self) 135 del self.args, self.kwds, self.func 136 try: --> 137 return next(self.gen) 138 except StopIteration: 139 raise RuntimeError("generator didn't yield") from None File ~/code/ibis/ibis/backends/duckdb/__init__.py:319, in Backend._safe_raw_sql(self, *args, **kwargs) 317 @contextlib.contextmanager 318 def _safe_raw_sql(self, *args, **kwargs): --> 319 yield self.raw_sql(*args, **kwargs) File ~/code/ibis/ibis/backends/duckdb/__init__.py:97, in Backend.raw_sql(self, query, **kwargs) 95 with contextlib.suppress(AttributeError): 96 query = query.sql(dialect=self.name) ---> 97 return self.con.execute(query, **kwargs) CatalogException: Catalog Error: Column with name A already exists! ```

related issues, though I don't think exactly the same:

IDK, this may also just be a bug with the duckdb backend, not a more general feature request.

What is the motivation behind your request?

I had a bug in my code where I accidentally created both columns "A" and "a". Then, I got this cryptic error later.

Describe the solution you'd like

Not sure exactly what to do here, but a few ideas:

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, not sure the story on other backends.

Code of Conduct

cpcloud commented 1 month ago

To do this, we'd have to encode the logic of every backend's identifier casing rules inside of Ibis.

DuckDB for some reason has decided on this--in my opinion--extremely confusing hybrid that makes it really difficult to predict when things will 1) work, 2) not fail but generate a deduplicated name with a suffix, or 3) utterly fail with an inscrutable error like that one you're showing.

While I sympathize with the ask here and I wish humanity would really just stop already with case-insensitive things in computers, I don't think the work and continued maintenance effort is justified.

NickCrews commented 1 month ago

Ok, if it's a big burden then I think it's fine to just punt, it's probably not THAT common. Maybe I can fix up this issue title to be SEO so others can find it.

Would it make sense to make a feature request for this in sqlglot? I haven't really looked at their architecture, but maybe since they already have so many per-backend rules, eg quoting, then this wouldn't be as hard for them as it would be for us?

cpcloud commented 2 weeks ago

I don't think it would make sense in sqlglot, as they generally aren't trying to unify database semantics where it's not easily possible.