Open zombie-einstein opened 3 months ago
In fact this seems to work ok if in line 1736 in .../modin/core/dataframe/pandas/dataframe/dataframe.py
1735 # Assume that the dtype is a scalar.
-> 1736 if not (col_dtypes == self_dtypes).all():
1737 new_dtypes = self_dtypes.copy()
the ordering of the arguments is swappe, i.e.:
1735 # Assume that the dtype is a scalar.
-> 1736 if not (self_dtypes == col_dtypes).all():
1737 new_dtypes = self_dtypes.copy()
I guess some ordering dependency in how the result is calculated?
I can open a PR to change, but not sure if there is some deeper reasoning here.
I have encountered this issue while working on a separate project and looking into this as the code suggests
self_dtypes
is of type pandas.Series
col_dtypes
in the else
branch is a scalar object
col_dtypes == self_dtypes
uses the col_dtypes.__eq__(obj)
method first, which for pandas
dtypes is implemented unconditionally, so it will always return False
as the left-hand side is a dtype and the left-hand side is a pandas.Series
for strings it works because the direct equality is not implemented, so it tires self_dtypes.__eq__(obj)
flipping the comparison, self_dtypes == col_dtypes
notices that the lhs is a series and the rhs is a scalar so it brodcasts to a pandas.Series
of type bool
, which has a member all
to call
@zombie-einstein, to me the solution seems correct, but I am new to modin
hope one of the maintainters of modin can take a further look into this issue
import pandas as pd
s = pd.Series(['a', 'b', 'c'])
int_64 = pd.Int64Dtype()
string = 'a'
print(s == string)
# 0 True
# 1 False
# 2 False
# dtype: bool
print(string == s)
# 0 True
# 1 False
# 2 False
# dtype: bool
print(int_64 == s)
# False
print(s == int_64)
# 0 False
# 1 False
# 2 False
# dtype: bool
The documentation of astype
of modin
itself does say the following
def astype(self, col_dtypes, errors: str = "raise"):
"""
Convert the columns dtypes to given dtypes.
Parameters
----------
col_dtypes : dictionary of {col: dtype,...} or str
Where col is the column name and dtype is a NumPy dtype.
errors : {'raise', 'ignore'}, default: 'raise'
Control raising of exceptions on invalid data for provided dtype.
Returns
-------
BaseDataFrame
Dataframe with updated dtypes.
"""
so assert isinstance(col_dtypes, dict | str)
should pass and we should not be passing dtype objects, however I would argue this is bad UX, for reference pandas
accepts data type objects for astype
, documentation here
Modin version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
I think this is the same error as #7276, but that issue was closed. When using
astype
on a Dataframe it fails at this checkWhen the type argument is a single value (i.e.
astype(pd.Int64Dtype())
) then it seems thatcol_dtypes == self_dtypes
works out as a single bool value (hence noall
attribute).Note that this works Ok if the argument is a dictionary of column namess to dtypes.
This also seems to be the same for Series, i.e.:
Fails with the same error
Expected Behavior
In native Pandas
casts the DataFrame/series to the argument type
Error Logs
No response
Installed Versions
'0.31.0'