modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.87k stars 651 forks source link

BUG: `Dataframe.astype(...)` Fails With `'bool' object has no attribute 'all'` #7364

Open zombie-einstein opened 3 months ago

zombie-einstein commented 3 months ago

Modin version checks

Reproducible Example

import modin.pandas as pd
import numpy as np

# This will cast to a float type due to NaNs
df = pd.DataFrame({"a": [1, 2, 3, np.nan]})

# This passes
df.astype({"a": pd.Int64Dtype()})
>>    a
0     1
1     2
2     3
3  <NA>

# This fails
df.astype(pd.Int64Dtype())
>> AttributeError: 'bool' object has no attribute 'all'

Issue Description

I think this is the same error as #7276, but that issue was closed. When using astype on a Dataframe it fails at this check

File .../modin/core/dataframe/pandas/dataframe/dataframe.py:
   1736, in PandasDataframe.astype(self, col_dtypes, errors)
   1730         return df.astype(
   1731             {k: v for k, v in col_dtypes.items() if k in df}, errors=errors
   1732         )
   1734 else:
   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()
   1738         new_dtype = pandas.api.types.pandas_dtype(col_dtypes)

AttributeError: 'bool' object has no attribute 'all'

When the type argument is a single value (i.e. astype(pd.Int64Dtype())) then it seems that col_dtypes == self_dtypes works out as a single bool value (hence no all attribute).

Note that this works Ok if the argument is a dictionary of column namess to dtypes.

This also seems to be the same for Series, i.e.:

df["a"].astype(pd.Int64Dtype())

Fails with the same error

Expected Behavior

In native Pandas

df.astype(pd.Int64Dtype())

casts the DataFrame/series to the argument type

Error Logs

No response

Installed Versions

'0.31.0'

zombie-einstein commented 3 months ago

In fact this seems to work ok if in line 1736 in .../modin/core/dataframe/pandas/dataframe/dataframe.py

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

the ordering of the arguments is swappe, i.e.:

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (self_dtypes == col_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

I guess some ordering dependency in how the result is calculated?

I can open a PR to change, but not sure if there is some deeper reasoning here.

aivanoved commented 2 months ago

I have encountered this issue while working on a separate project and looking into this as the code suggests

self_dtypes is of type pandas.Series col_dtypes in the else branch is a scalar object

col_dtypes == self_dtypes uses the col_dtypes.__eq__(obj) method first, which for pandas dtypes is implemented unconditionally, so it will always return False as the left-hand side is a dtype and the left-hand side is a pandas.Series for strings it works because the direct equality is not implemented, so it tires self_dtypes.__eq__(obj)

flipping the comparison, self_dtypes == col_dtypes notices that the lhs is a series and the rhs is a scalar so it brodcasts to a pandas.Series of type bool, which has a member all to call

@zombie-einstein, to me the solution seems correct, but I am new to modin

hope one of the maintainters of modin can take a further look into this issue

More general context below

here is an example of why the flipping works:

import pandas as pd

s = pd.Series(['a', 'b', 'c'])

int_64 = pd.Int64Dtype()
string = 'a'

print(s == string)
# 0     True
# 1    False
# 2    False
# dtype: bool

print(string == s)
# 0     True
# 1    False
# 2    False
# dtype: bool

print(int_64 == s)
# False

print(s == int_64)
# 0    False
# 1    False
# 2    False
# dtype: bool

note on usage and thoughts on it:

The documentation of astype of modin itself does say the following

def astype(self, col_dtypes, errors: str = "raise"):
    """
    Convert the columns dtypes to given dtypes.

    Parameters
    ----------
    col_dtypes : dictionary of {col: dtype,...} or str
        Where col is the column name and dtype is a NumPy dtype.
    errors : {'raise', 'ignore'}, default: 'raise'
        Control raising of exceptions on invalid data for provided dtype.

    Returns
    -------
    BaseDataFrame
        Dataframe with updated dtypes.
    """

so assert isinstance(col_dtypes, dict | str) should pass and we should not be passing dtype objects, however I would argue this is bad UX, for reference pandas accepts data type objects for astype, documentation here