pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

Convert type nullable int <-> nullable string #31839

Closed tritemio closed 4 years ago

tritemio commented 4 years ago

Code Sample, a copy-pastable example if possible

This code raises an error:

>>> s = pd.Series([0, pd.NA], dtype='Int8')
>>> s.astype('string')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e5922a9c01ed> in <module>
----> 1 s.astype('string')

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5696         else:
   5697             # else, only a single dtype is given
-> 5698             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5699             return self._constructor(new_data).__finalize__(self)
   5700 

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    580 
    581     def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    583 
    584     def convert(self, **kwargs):

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    605         if self.is_extension:
    606             # TODO: Should we try/except this astype?
--> 607             values = self.values.astype(dtype)
    608         else:
    609             if issubclass(dtype.type, str):

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/arrays/integer.py in astype(self, dtype, copy)
    463             kwargs = {}
    464 
--> 465         data = self.to_numpy(dtype=dtype, **kwargs)
    466         return astype_nansafe(data, dtype, copy=False)
    467 

/opt/tljh/user/envs/py37-sf/lib/python3.7/site-packages/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
    130                 )
    131             # don't pass copy to astype -> always need a copy since we are mutating
--> 132             data = self._data.astype(dtype)
    133             data[self._mask] = na_value
    134         else:

TypeError: data type not understood

Problem description

Converting a nullable int to nullable string requires a double conversion:

s = pd.Series([0, 1], dtype='Int8')
s.astype(str).astype('string')

Likewise, converting a nullable string to nullable int requires two steps:

s = pd.Series(['0', '1'], dtype='string')
s.astype('int8').astype('Int8')

Moreover, if the nullable string series has NAs, converting to a nullable int becomes much harder (I don't know if there is a simpler way):

s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8')

Expected Output

I would be nice to directly convert from nullable int to nullable string and vice versa in one step.

I'd like for all the following conversions to work:

s = pd.Series([0, 1], dtype='Int8')
s.astype('string')   # currently raises
s = pd.Series(['0', '1'], dtype='string')
s.astype('Int8')  # currently raises
s = pd.Series(['0', pd.NA], dtype='string')
s.astype('Int8')  # currently raises

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-1058-aws machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.1.0.post20200119 Cython : 0.29.13 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.8.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None tabulate : 0.8.3 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None numba : 0.45.1
TomAugspurger commented 4 years ago

Thanks. Duplicate of https://github.com/pandas-dev/pandas/issues/31204 I think.

tritemio commented 4 years ago

@TomAugspurger, sorry I couldn't find the previous issue #31204. I'll comment there to add the roundtrip argument (Int -> string -> Int).