ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.81k stars 574 forks source link

bug: arrow type error when show data with 'UUID' object #8532

Open jitingxu1 opened 5 months ago

jitingxu1 commented 5 months ago

What happened?

Issue 1: duckdb will produce different uuid for each row, but same uuid generated by sqlite, there maybe other backends have the same issue.

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t.mutate(uuid=ibis.uuid()).to_pandas()
image

Issue 2: get ArrowTypeError when show data:

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t1 = t.mutate(uuid=uuid())
t1[t1.my_uuid].head()

Got the following error:

Out[7]: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/voltrondata/repos/ibis/ibis/expr/types/relations.py:516 in __interactive_rich_console__   │
│                                                                                                  │
│    513 │   │   │   width = options.max_width                                                     │
│    514 │   │                                                                                     │
│    515 │   │   try:                                                                              │
│ ❱  516 │   │   │   table = to_rich_table(self, width)                                            │
│    517 │   │   except Exception as e:                                                            │
│    518 │   │   │   # In IPython exceptions inside of _repr_mimebundle_ are swallowed to          │
│    519 │   │   │   # allow calling several display functions and choosing to display             │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/pretty.py:265 in to_rich_table                     │
│                                                                                                  │
│   262 │                                                                                          │
│   263 │   # Compute the data and return a pandas dataframe                                       │
│   264 │   nrows = ibis.options.repr.interactive.max_rows                                         │
│ ❱ 265 │   result = table.limit(nrows + 1).to_pyarrow()                                           │
│   266 │                                                                                          │
│   267 │   # Now format the columns in order, stopping if the console width would                 │
│   268 │   # be exceeded.                                                                         │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/core.py:425 in to_pyarrow                          │
│                                                                                                  │
│   422 │   │   Table                                                                              │
│   423 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   424 │   │   """                                                                                │
│ ❱ 425 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   426 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   427 │   │   )                                                                                  │
│   428                                                                                            │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/__init__.py:218 in to_pyarrow                        │
│                                                                                                  │
│    215 │   │   table_expr = expr.as_table()                                                      │
│    216 │   │   schema = table_expr.schema()                                                      │
│    217 │   │   arrow_schema = schema.to_pyarrow()                                                │
│ ❱  218 │   │   with self.to_pyarrow_batches(                                                     │
│    219 │   │   │   table_expr, params=params, limit=limit, **kwargs                              │
│    220 │   │   ) as reader:                                                                      │
│    221 │   │   │   table = pa.Table.from_batches(reader, schema=arrow_schema)                    │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/sqlite/__init__.py:264 in to_pyarrow_batches         │
│                                                                                                  │
│   261 │   │   │   self.compile(expr, limit=limit, params=params)                                 │
│   262 │   │   ) as cursor:                                                                       │
│   263 │   │   │   df = self._fetch_from_cursor(cursor, schema)                                   │
│ ❱ 264 │   │   table = pa.Table.from_pandas(                                                      │
│   265 │   │   │   df, schema=schema.to_pyarrow(), preserve_index=False                           │
│   266 │   │   )                                                                                  │
│   267 │   │   return table.to_reader(max_chunksize=chunk_size)                                   │
│                                                                                                  │
│ in pyarrow.lib.Table.from_pandas:3874                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in dataframe_to_arrays                                                                   │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in <listcomp>                                                                            │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:598 in convert_column                                                                        │
│                                                                                                  │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│    596 │   │   │   e.args += ("Conversion failed for column {!s} with type {!s}"                 │
│    597 │   │   │   │   │      .format(col.name, col.dtype),)                                     │
│ ❱  598 │   │   │   raise e                                                                       │
│    599 │   │   if not field_nullable and result.null_count > 0:                                  │
│    600 │   │   │   raise ValueError("Field {} was non-nullable but pandas column "               │
│    601 │   │   │   │   │   │   │    "had {} null values".format(str(field),                      │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:592 in convert_column                                                                        │
│                                                                                                  │
│    589 │   │   │   type_ = field.type                                                            │
│    590 │   │                                                                                     │
│    591 │   │   try:                                                                              │
│ ❱  592 │   │   │   result = pa.array(col, type=type_, from_pandas=True, safe=safe)               │
│    593 │   │   except (pa.ArrowInvalid,                                                          │
│    594 │   │   │   │   pa.ArrowNotImplementedError,                                              │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│                                                                                                  │
│ in pyarrow.lib.array:340                                                                         │
│                                                                                                  │
│ in pyarrow.lib._ndarray_to_array:86                                                              │
│                                                                                                  │
│ in pyarrow.lib.check_status:91                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ArrowTypeError: ("Expected bytes, got a 'UUID' object", 'Conversion failed for column my_uuid with type
object')

it works well for to_pandas()

In [8]: t1[t1.my_uuid].to_pandas()
Out[8]:
                                  my_uuid
0    3f661a76-2d0e-4622-862e-1c4adcfd4813
1    3f661a76-2d0e-4622-862e-1c4adcfd4813
2    3f661a76-2d0e-4622-862e-1c4adcfd4813
3    3f661a76-2d0e-4622-862e-1c4adcfd4813
4    3f661a76-2d0e-4622-862e-1c4adcfd4813
..                                    ...
339  3f661a76-2d0e-4622-862e-1c4adcfd4813
340  3f661a76-2d0e-4622-862e-1c4adcfd4813
341  3f661a76-2d0e-4622-862e-1c4adcfd4813
342  3f661a76-2d0e-4622-862e-1c4adcfd4813
343  3f661a76-2d0e-4622-862e-1c4adcfd4813

What version of ibis are you using?

8.0.0

What backend(s) are you using, if any?

duckdb, sqlite

Relevant log output

No response

Code of Conduct

jcrist commented 5 months ago

Thanks for opening this. Issue 1 should be fixed in #8535.

Issue 2 is due to the to_pyarrow conversion path in sqlite (and a few other backends) going dbapi row -> pandas -> pyarrow. When returning pandas dataframes from to_pandas we currently map a UUID column to an object dtype series of uuid.UUID objects (and these objects fail when converting to pyarrow). In contrast, for to_pyarrow we return a string column with the same data.

The easiest (and I think most consistent) fix would be to stop returning uuid columns in to_pandas as uuid.UUID values and instead treat them as strings. This matches what we do for both polars and pyarrow outputs. It's also more efficient for the user since they don't have an object dtype series in the output series.

cc @cpcloud for a :+1: / :-1: before I implement this fix.

kszucs commented 5 months ago

Eventually we should simplify the pandas output until .to_pyarrow().to_pandas() to offload all the conversion duties to arrow. So it is a +1 from me.

cpcloud commented 5 months ago

Seems fine. I don't like that we have to do this but the alternative of implementing a custom pyarrow type seems less desirable than converting to strings.

cpcloud commented 1 week ago

The repeated UUID issue has been addressed:

In [4]: import ibis
   ...: ibis.options.interactive = True
   ...: from ibis.expr.api import row_number, uuid, now, pi
   ...:
   ...: ibis.set_backend("sqlite")
   ...: t = ibis.examples.penguins.fetch()
   ...: t.mutate(uuid=ibis.uuid()).to_pandas()
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year                                  uuid
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007  f3102c7e-167c-4854-af20-d3729580e2cc
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007  86afae37-f0e5-48d7-ba13-aa701374d4cd
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007  665cbe36-d7b7-4a7e-bd5f-ebf0c043c72f
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007  a740b304-0a13-4f89-bdb4-fa9475f2daa4
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007  a8263a30-9cbb-4175-94e9-1429a6fdb0fa
..         ...        ...             ...            ...                ...          ...     ...   ...                                   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009  112e7fcc-bf14-4177-96ee-526d9343c368
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009  c5964727-4dc0-42dd-8039-1527bd37b673
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009  a3d1a137-5847-4309-90c4-59d0f8fe35f9
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009  33da8b0b-d368-442c-ba05-44daa037b1e0
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009  de5b1031-de4a-4c13-85a6-920e4741f922