pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.56k stars 17.9k forks source link

BUG: FloatingArray constructor with input of all missing values: np.nan vs. pd.NA #38751

Open arw2019 opened 3 years ago

arw2019 commented 3 years ago

Not sure this is really an issue but maybe(or not?) a slight inconsistency.

The following throws:

In [18]: import numpy as np
    ...: import pandas as pd
    ...: 
    ...: arr = pd.array([pd.NA, pd.NA], dtype="float")
    ...: ser = pd.Series(arr)
    ...: pd.to_numeric(ser, downcast="float")
    ...: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-c7fb0ba8f6eb> in <module>
      2 import pandas as pd
      3 
----> 4 arr = pd.array([pd.NA, pd.NA], dtype="float")
      5 ser = pd.Series(arr)
      6 pd.to_numeric(ser, downcast="float")

~/repos/pandas/pandas/core/construction.py in array(data, dtype, copy)
    344         return TimedeltaArray._from_sequence(data, dtype=dtype, copy=copy)
    345 
--> 346     result = PandasArray._from_sequence(data, dtype=dtype, copy=copy)
    347     return result
    348 

~/repos/pandas/pandas/core/arrays/numpy_.py in _from_sequence(cls, scalars, dtype, copy)
    178             dtype = dtype._dtype
    179 
--> 180         result = np.asarray(scalars, dtype=dtype)
    181         if copy and result is scalars:
    182             result = result.copy()

~/anaconda3/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

TypeError: float() argument must be a string or a number, not 'NAType'

but with np.nan it runs fine:

In [19]: import numpy as np
    ...: import pandas as pd
    ...: 
    ...: arr = pd.array([np.nan, np.nan], dtype="float")
    ...: ser = pd.Series(arr)
    ...: pd.to_numeric(ser, downcast="float")
    ...: 
Out[19]: 
0   NaN
1   NaN
dtype: float32
jorisvandenbossche commented 3 years ago

Note that this is not directly related to to_numeric, as it is the pd.array() construction that fails:

In [11]: pd.array([pd.NA, pd.NA], dtype="float")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-73361050ec83> in <module>
----> 1 pd.array([pd.NA, pd.NA], dtype="float")

~/scipy/pandas/pandas/core/construction.py in array(data, dtype, copy)
    344         return TimedeltaArray._from_sequence(data, dtype=dtype, copy=copy)
    345 
--> 346     result = PandasArray._from_sequence(data, dtype=dtype, copy=copy)
    347     return result
    348 

~/scipy/pandas/pandas/core/arrays/numpy_.py in _from_sequence(cls, scalars, dtype, copy)
    178             dtype = dtype._dtype
    179 
--> 180         result = np.asarray(scalars, dtype=dtype)
    181         if copy and result is scalars:
    182             result = result.copy()

~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: float() argument must be a string or a number, not 'NAType'

And this is because with dtype="float" it actually tries to make a numpy-based float array, not a nullable pandas FloatingArray. And therefore it tries to convert pd.NA to a float, under the hood the error comes from:

In [12]: float(pd.NA)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-7541b87f4222> in <module>
----> 1 float(pd.NA)

TypeError: float() argument must be a string or a number, not 'NAType'
arw2019 commented 3 years ago

Updated the title to reflect the actual issue.

Are we ok with this behavior or is it something we want to "fix"?

jorisvandenbossche commented 3 years ago

I don't think it's something we plan to fix on the short term. At some point in the future, we might want that those lower-case names like "float" will mean the nullable dtypes instead of the plain numpy ones within a pandas context. But then this issue will be resolved automatically (since converting all NA list to nullable float is already working).