scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
75 stars 17 forks source link

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

Closed DriesSchaumont closed 1 year ago

DriesSchaumont commented 1 year ago

Describe the bug With pandas 2.0.0, the concat behavior has changed when concatenating a boolean and numeric dtype. It the resulting dtype used to be a numeric dtype, which can be written by mudata. However, this has been changed to object, which results in TypeError: Can't implicitly convert non-string objects to strings. The behavior of bool + nan is also different from the behaviour of str + nan, the latter causing no problems.

Warning in pandas 1.5.3:

FutureWarning: Behavior when concatenating bool-dtype and numeric-dtype arrays is deprecated; in a future version these will cast to object dtype (instead of coercing bools to numeric values). To retain the old behavior, explicitly cast bool-dtype arrays to numeric dtype.

To Reproduce

import pandas as pd
import mudata
import anndata
import numpy as np
from itertools import product
import warnings

dtype_matrix = {"na": np.nan, "string": "str", "bool": True, "float": 1.0}

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for first_col, second_col in product(dtype_matrix.items(), repeat=2):
        first_col_type, first_col_val = first_col
        second_col_type, second_col_val = second_col
        m = mudata.MuData({
            "mod1": anndata.AnnData(pd.DataFrame([[1,2], [3,4]]), obs=pd.DataFrame(index=list("AB")), var=pd.DataFrame([["a", "b"], ["c", "d"]], index=["q", "w"], columns=["var1", "overlap"]), dtype=np.float64),
            "mod2": anndata.AnnData(pd.DataFrame([[5,6], [7,8]]), obs=pd.DataFrame(index=list("CD")), var=pd.DataFrame([["e", "f"], ["g", "h"]], index=["x", "y"], columns=["var2", "overlap"]), dtype=np.float64),
        })
        m.mod['mod1'].var['test'] = first_col_val        
        m.mod['mod2'].var['test'] = second_col_val
        m.update()
        could_write = True
        try:
            m.write("test.h5mu")
        except TypeError as e:
            could_write = False

        print(f"Concat {first_col_type} ({first_col_val}, {m.mod['mod1'].var['test'].dtype}) and {second_col_type} ({second_col_val}, {m.mod['mod2'].var['test'].dtype}) results in: {m.var['test'].dtype}, able to write: {could_write}")

print(f"Pandas: {pd.__version__}")
print(f"anndata: {anndata.__version__}")
print(f"mudata: {mudata.__version__}")

With pandas 2.0.0:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: object, able to write: False <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: object, able to write: False <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 2.0.0
anndata: 0.8.0
mudata: 0.2.2

With pandas 1.5.3:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: float64, able to write: True <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: float64, able to write: True <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 1.5.3
anndata: 0.8.0
mudata: 0.2.2

I think this can be tracked down to this concat: https://github.com/scverse/mudata/blob/da2de81261db76368da0a712cf819df3abb53fb7/mudata/_core/mudata.py#L543-L548

Expected behaviour I would not expect a change in behavior.

System

Additional context Could be related to https://github.com/scverse/anndata/issues/679 but the issue being reported here is a behavior change so I would flag this as a separate bug (either way the discrepancy between str + nan and bool + nan should be resolved).

gtca commented 1 year ago

Hey @DriesSchaumont,

Thanks for noticing this change of behaviour with pandas 2.0 and providing a great example to test it.

I've started addressing it in https://github.com/scverse/mudata/pull/43 with boolean + nan value combination that you highlighted. So far I'm taking advantage of nullable boolean arrays.

In case you have any thoughts on what behaviour you would find most intuitive and/or how we can potentially generalise this decision making beyond just bool -> boolean conversion for nullable boolean arrays, I'd be interested to discuss it!

gtca commented 1 year ago

By the way, already with pandas 1.5.2 and mudata 0.2.3, float + bool is coerced to an object (same as bool + float).

And a short update is that mudata 0.3.0 will try to be more careful with using nullable boolean arrays to avoid potential issues like https://github.com/scverse/muon/issues/111 (e.g. by using bool when there is no NA in the column in the end).