pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid ``fill_value`` for certain subtypes #59063

Closed christopher-titchen closed 5 days ago

christopher-titchen commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd
from scipy.sparse import eye

pd.DataFrame.sparse.from_spmatrix(eye(2, dtype=bool))

Issue Description

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/pandas/pandas/core/arrays/sparse/accessor.py", line 316, in from_spmatrix
    dtype = SparseDtype(array_data.dtype, 0)
  File "/pandas/pandas/core/dtypes/dtypes.py", line 1751, in __init__
    self._check_fill_value()
  File "/pandas/pandas/core/dtypes/dtypes.py", line 1835, in _check_fill_value
    raise ValueError(
ValueError: fill_value must be a valid value for the SparseDtype.subtype

Expected Behavior

The default argument for fill_value should be used instead of passing 0, which will fix the issue as the default missing value selected for bool is False. This bug also affects other dtypes like float and complex without raising a ValueError, as a fill_value of 0. or np.nan and 0. + 0.j, np.nan + 0.j, or np.nan respectively are more appropriate than 0.

We can also introduce a fill_value parameter to the DataFrame.sparse.from_spmatrix method, with a default argument of None, to fix the issue whilst giving the user flexibility to select a fill_value of choice.

https://github.com/pandas-dev/pandas/blob/c46fb76afaf98153b9eef97fc9bbe9077229e7cd/pandas/core/arrays/sparse/accessor.py#L316

https://github.com/pandas-dev/pandas/blob/c46fb76afaf98153b9eef97fc9bbe9077229e7cd/pandas/core/dtypes/missing.py#L638-L641

Installed Versions

INSTALLED VERSIONS ------------------ commit : c46fb76afaf98153b9eef97fc9bbe9077229e7cd python : 3.10.14.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 3.0.0.dev0+1125.gc46fb76afa numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 70.1.0 pip : 24.0 Cython : 3.0.10 pytest : 8.2.2 hypothesis : 6.103.2 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : 5.2.2 html5lib : 1.1 pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql : None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 fastparquet : 2024.5.0 fsspec : 2024.6.0 gcsfs : 2024.6.0 matplotlib : 3.8.4 numba : 0.59.1 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pyarrow : 16.1.0 pyreadstat : 1.2.7 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.6.0 scipy : 1.13.1 sqlalchemy : 2.0.31 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.6.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None