Open user-jx opened 4 weeks ago
I'm surprised you don't get an error:
/tmp/ipykernel_9568/1664581211.py:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '0' has dtype incompatible with datetime64[ns], please explicitly cast to a compatible dtype first.
df.loc[0,'D'] = 0; df.loc[0,'D'] = float('nan')
I don't have the latest version of pandas but a slightly earlier one, this is probably why I don't get the error. So, DataFrames can accept only one dtype in each column now? If this is the case, then my questions are probably answered. Thank you for your time.
I'd ask for a second opinion, dealing with missing data in pandas is a whole different can of worms nowadays.
Thanks for the report. When using multiple columns, pandas uses the groupby internals to determine what the duplicates are. groupby identifies all NA values as s single group. This is #48476. When using a single column pandas uses Series.duplicated
which uses a hashmap for better performance, which does differentiate between NA values.
Marking this as needs discussion for now as we need to agree on which of the two behaviors we want for both operations.
See https://github.com/pandas-dev/pandas/issues/59891 and links therein.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Hi, I have the following pandas DataFrame:
With
I get:
But when using the parameter
subset
for the only column that has a difference,I get:
I have two questions:
pandas.NaT
andfloat('nan')
are considered as different values bydrop_duplicates()
?Thank you!
Expected Behavior
I expected that the outcome would be the same in both cases.
Installed Versions