ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.53k stars 552 forks source link

bug: Failed to load Postgres table containing ENUM #9295

Open hazirliver opened 1 month ago

hazirliver commented 1 month ago

What happened?

I encountered an issue when attempting to load a PostgreSQL table dataset_source that contains a custom ENUM data type dataset_source_enum. The error occurs both when using con.table('dataset_source') and con.sql("SELECT * FROM dataset_source;"). Steps to reproduce:

  1. Create a custom ENUM type in PostgreSQL:

    CREATE TYPE "dataset_source_enum" AS ENUM (
    'GEO',
    'SRA',
    );
  2. Create a table using the ENUM type:

    CREATE TABLE "dataset_source"
    (
    "dataset_source_id" uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    "dataset_source_name" dataset_source_enum UNIQUE,
    "dataset_source_desc" text
    );
  3. Insert some data into the table:

    INSERT INTO "dataset_source" ("dataset_source_name", "dataset_source_desc")
    VALUES 
    ('GEO', 'Gene Expression Omnibus'),
    ('SRA', 'Sequence Read Archive'),
  4. Attempt to load the table in ibis:

    
    import ibis
    ibis.options.interactive = True

db_config = { 'user': '', 'password': "", 'host': '', 'port': '', 'database': '***' }

con = ibis.postgres.connect(*db_config) con.table('dataset_source') # gives same error: con.sql("SELECT FROM dataset_source;")


Error message is provided in `Relevant log output` section

### What version of ibis are you using?

9.0.0

### What backend(s) are you using, if any?

PostgeSQL

### Relevant log output

```sh
---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/parser.py:1205, in Parser.parse_into(self, expression_types, raw_tokens, sql)
   1204 try:
-> 1205     return self._parse(parser, raw_tokens, sql)
   1206 except ParseError as e:

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/parser.py:1244, in Parser._parse(self, parse_method, raw_tokens, sql)
   1243 if self._index < len(self._tokens):
-> 1244     self.raise_error("Invalid expression / Unexpected token")
   1246 self.check_errors()

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/parser.py:1285, in Parser.raise_error(self, message, token)
   1284 if self.error_level == ErrorLevel.IMMEDIATE:
-> 1285     raise error
   1287 self.errors.append(error)

ParseError: Invalid expression / Unexpected token. Line 1, Col: 19.
  dataset_source_enum

The above exception was the direct cause of the following exception:

ParseError                                Traceback (most recent call last)
Cell In[83], line 1
----> 1 con.table('dataset_source')

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/ibis/backends/sql/__init__.py:137, in SQLBackend.table(self, name, schema, database)
    134     catalog = table_loc.catalog or None
    135     database = table_loc.db or None
--> 137 table_schema = self.get_schema(name, catalog=catalog, database=database)
    138 return ops.DatabaseTable(
    139     name,
    140     schema=table_schema,
    141     source=self,
    142     namespace=ops.Namespace(catalog=catalog, database=database),
    143 ).to_expr()

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/ibis/backends/postgres/__init__.py:589, in Backend.get_schema(self, name, catalog, database)
    584 if not rows:
    585     raise com.IbisError(f"Table not found: {name!r}")
    587 return sch.Schema(
    588     {
--> 589         col: type_mapper.from_string(typestr, nullable=nullable)
    590         for col, typestr, nullable in rows
    591     }
    592 )

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/ibis/backends/sql/datatypes.py:426, in PostgresType.from_string(cls, text, nullable)
    423 if dtype := cls.unknown_type_strings.get(text.lower()):
    424     return dtype
--> 426 sgtype = sg.parse_one(text, into=sge.DataType, read=cls.dialect)
    427 return cls.to_ibis(sgtype, nullable=nullable)

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/__init__.py:137, in parse_one(sql, read, dialect, into, **opts)
    134 dialect = Dialect.get_or_raise(read or dialect)
    136 if into:
--> 137     result = dialect.parse_into(into, sql, **opts)
    138 else:
    139     result = dialect.parse(sql, **opts)

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/dialects/dialect.py:511, in Dialect.parse_into(self, expression_type, sql, **opts)
    508 def parse_into(
    509     self, expression_type: exp.IntoType, sql: str, **opts
    510 ) -> t.List[t.Optional[exp.Expression]]:
--> 511     return self.parser(**opts).parse_into(expression_type, self.tokenize(sql), sql)

File ~/Documents/BG/deconv_deploy/airflow_venv/lib/python3.12/site-packages/sqlglot/parser.py:1210, in Parser.parse_into(self, expression_types, raw_tokens, sql)
   1207         e.errors[0]["into_expression"] = expression_type
   1208         errors.append(e)
-> 1210 raise ParseError(
   1211     f"Failed to parse '{sql or raw_tokens}' into {expression_types}",
   1212     errors=merge_errors(errors),
   1213 ) from errors[-1]

ParseError: Failed to parse 'dataset_source_enum' into <class 'sqlglot.expressions.DataType'>

Code of Conduct

cpcloud commented 3 weeks ago

@hazirliver Thanks for the report, looks like a bug!

gforsyth commented 3 weeks ago

How do we want to approach this? All we have at parse time is (I think) the string of the datatype name, which in this case is "dataset_source_enum"

Do we want to have a naming convention around custom types? Or should we just fallback to treating things as strings if we (and sqlglot) can't recognize it?

cpcloud commented 3 weeks ago

Naturally, this is possible with a bit of elbow grease, and it's a bit annoying:

  1. Get the known types from the database (run psql -E and then run \dT+ to see the SQL used to list custom types)
  2. There are effectively three kinds of custom types: fixed-size (enums are one of these), variable length (e.g., text or binary), and tuple (named struct types essentially), so then use that to determine what the custom type corresponds to
  3. Map that into Ibis's sqlglot type converter, perhaps with a new context argument that is perhaps a mapping from custom typename -> underlying type that sqlglot knows about.