zombie-einstein commented 3 months ago

Modin version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import numpy as np

# This will cast to a float type due to NaNs
df = pd.DataFrame({"a": [1, 2, 3, np.nan]})

# This passes
df.astype({"a": pd.Int64Dtype()})
>>    a
0     1
1     2
2     3
3  <NA>

# This fails
df.astype(pd.Int64Dtype())
>> AttributeError: 'bool' object has no attribute 'all'

Issue Description

I think this is the same error as #7276, but that issue was closed. When using astype on a Dataframe it fails at this check

File .../modin/core/dataframe/pandas/dataframe/dataframe.py:
   1736, in PandasDataframe.astype(self, col_dtypes, errors)
   1730         return df.astype(
   1731             {k: v for k, v in col_dtypes.items() if k in df}, errors=errors
   1732         )
   1734 else:
   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()
   1738         new_dtype = pandas.api.types.pandas_dtype(col_dtypes)

AttributeError: 'bool' object has no attribute 'all'

When the type argument is a single value (i.e. astype(pd.Int64Dtype())) then it seems that col_dtypes == self_dtypes works out as a single bool value (hence no all attribute).

Note that this works Ok if the argument is a dictionary of column namess to dtypes.

This also seems to be the same for Series, i.e.:

df["a"].astype(pd.Int64Dtype())

Fails with the same error

Expected Behavior

In native Pandas

df.astype(pd.Int64Dtype())

casts the DataFrame/series to the argument type

Error Logs

No response

Installed Versions

'0.31.0'

zombie-einstein commented 3 months ago

In fact this seems to work ok if in line 1736 in .../modin/core/dataframe/pandas/dataframe/dataframe.py

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

the ordering of the arguments is swappe, i.e.:

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (self_dtypes == col_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

I guess some ordering dependency in how the result is calculated?

I can open a PR to change, but not sure if there is some deeper reasoning here.

aivanoved commented 2 months ago

I have encountered this issue while working on a separate project and looking into this as the code suggests

self_dtypes is of type pandas.Series col_dtypes in the else branch is a scalar object

col_dtypes == self_dtypes uses the col_dtypes.__eq__(obj) method first, which for pandas dtypes is implemented unconditionally, so it will always return False as the left-hand side is a dtype and the left-hand side is a pandas.Series for strings it works because the direct equality is not implemented, so it tires self_dtypes.__eq__(obj)

flipping the comparison, self_dtypes == col_dtypes notices that the lhs is a series and the rhs is a scalar so it brodcasts to a pandas.Series of type bool, which has a member all to call

@zombie-einstein, to me the solution seems correct, but I am new to modin

hope one of the maintainters of modin can take a further look into this issue

More general context below

here is an example of why the flipping works:

import pandas as pd

s = pd.Series(['a', 'b', 'c'])

int_64 = pd.Int64Dtype()
string = 'a'

print(s == string)
# 0     True
# 1    False
# 2    False
# dtype: bool

print(string == s)
# 0     True
# 1    False
# 2    False
# dtype: bool

print(int_64 == s)
# False

print(s == int_64)
# 0    False
# 1    False
# 2    False
# dtype: bool

note on usage and thoughts on it:

The documentation of astype of modin itself does say the following

def astype(self, col_dtypes, errors: str = "raise"):
    """
    Convert the columns dtypes to given dtypes.

    Parameters
    ----------
    col_dtypes : dictionary of {col: dtype,...} or str
        Where col is the column name and dtype is a NumPy dtype.
    errors : {'raise', 'ignore'}, default: 'raise'
        Control raising of exceptions on invalid data for provided dtype.

    Returns
    -------
    BaseDataFrame
        Dataframe with updated dtypes.
    """

so assert isinstance(col_dtypes, dict | str) should pass and we should not be passing dtype objects, however I would argue this is bad UX, for reference pandas accepts data type objects for astype, documentation here

modin-project / modin

BUG: `Dataframe.astype(...)` Fails With `'bool' object has no attribute 'all'` #7364