skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.18k stars 97 forks source link

`TableVectorizer` fails on sparse Dataframes #679

Closed LeoGrin closed 1 year ago

LeoGrin commented 1 year ago

Describe the bug

TableVectorizer fails when fitting on dataframes with sparse dtypes.

Steps/Code to Reproduce

from skrub import TableVectorizer
import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4, 5]
})
# convert to sparse
df['a'] = pd.arrays.SparseArray(df['a'])
tbl_vec = TableVectorizer()
tbl_vec.fit(df)

Expected Results

No error is thrown.

Actual Results

Unexpected exception formatting exception. Falling back to standard exception
Traceback (most recent call last):
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/fy/__8z8cpn6gs04465sq9g1nq80000gn/T/ipykernel_7311/2936428833.py", line 10, in <module>
    tbl_vec.fit(df)
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 694, in fit
    self.fit_transform(X, y=y)
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/leo/VSCProjects/skrub/skrub/_table_vectorizer.py", line 607, in fit_transform
  File "/Users/leo/VSCProjects/skrub/skrub/_table_vectorizer.py", line 130, in _replace_false_missing
    "NaN",
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/frame.py", line 5582, in replace
    return super().replace(
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/generic.py", line 7383, in replace
    new_data = self._mgr.replace_regex(
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 484, in replace_regex
    return self.apply("_replace_regex", **kwargs, using_cow=using_copy_on_write())
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 698, in _replace_regex
    replace_regex(new_values, rx, value, mask)
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/array_algos/replace.py", line 148, in replace_regex
    values[:] = f(values)
  File "/Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/pandas/core/arrays/sparse/array.py", line 586, in __setitem__
    raise TypeError(msg)
TypeError: SparseArray does not support item assignment via setitem

Versions

System:
    python: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:01:19) [Clang 14.0.6 ]
executable: /Users/leo/mambaforge/envs/skrub/bin/python
   machine: macOS-12.6.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 23.1.2
   setuptools: 67.7.2
        numpy: 1.24.3
        scipy: 1.10.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/leo/mambaforge/envs/skrub/lib/libopenblas.0.dylib
        version: 0.3.23
threading_layer: openmp
   architecture: VORTEX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/leo/mambaforge/envs/skrub/lib/libomp.dylib
        version: None
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/leo/mambaforge/envs/skrub/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 8
0.0.1.dev0
LilianBoulard commented 1 year ago

Annoying, though expected (I didn't even know pandas had sparse DataFrames lol). Thanks for reporting :)

Vincent-Maladiere commented 1 year ago

Interesting, which situations trigger a sparse pandas array? It seems fairly rare. Also, we should support CSR matrices for the GapEncoder and MinHashEncoder, but that sounds like a longer-term feature.

LeoGrin commented 1 year ago

Interesting, which situations trigger a sparse pandas array? It seems fairly rare.

It happens quite a lot when fetching datasets on openml (see #665, where it happened 127 times)

Vincent-Maladiere commented 1 year ago

Fixed by https://github.com/skrub-data/skrub/pull/737