pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

Series of object/strings cannot be converted to Int64Dtype() #28599

Closed mar-ses closed 4 years ago

mar-ses commented 4 years ago

Edited to add information.

Code Sample, a copy-pastable example if possible

a  = pd.Series(['123', '345', '456'])
a.astype(int)            # works
a.astype('Int64')      # doesn't work

Problem description

Currently, the conversion of object dtypes (containing strings) to Int64 doesn't work, even though it should be able to. It produces a long error (see at the end).

Important to note: the above is trying to convert to Int64 with the capital I. Those are the new nullable-integer arrays that got added to python. pandas seems to support them, yet I think something inside astype wasn't update to reflect that.

In essence, the above should work; there is no reason why it should fail and it's quite simply a bug (in answer to some comments). Moreover, to_numeric is not a sufficient replacement here; it doesn't convert to Int64 when there are missing datatypes, instead it converts to float automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).

Traceback:

``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () 2 a = pd.Series(['123', '345', '456']) 3 print(a.astype(int)) ----> 4 print(a.astype('Int64')) /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs) 5880 # else, only a single dtype is given 5881 new_data = self._data.astype( -> 5882 dtype=dtype, copy=copy, errors=errors, **kwargs 5883 ) 5884 return self._constructor(new_data).__finalize__(self) /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs) 579 580 def astype(self, dtype, **kwargs): --> 581 return self.apply("astype", dtype=dtype, **kwargs) 582 583 def convert(self, **kwargs): /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy) 437 --> 438 applied = getattr(b, f)(**kwargs) 439 result_blocks = _extend_blocks(applied, result_blocks) 440 /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs) 557 558 def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs): --> 559 return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs) 560 561 def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs): /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs) 641 # _astype_nansafe works fine with 1-d only 642 vals1d = values.ravel() --> 643 values = astype_nansafe(vals1d, dtype, copy=True, **kwargs) 644 645 # TODO(extension) /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 648 # dispatch on extension dtype if needed 649 if is_extension_array_dtype(dtype): --> 650 return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy) 651 652 if not isinstance(dtype, np.dtype): /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy) 321 @classmethod 322 def _from_sequence(cls, scalars, dtype=None, copy=False): --> 323 return integer_array(scalars, dtype=dtype, copy=copy) 324 325 @classmethod /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy) 105 TypeError if incompatible types 106 """ --> 107 values, mask = coerce_to_array(values, dtype=dtype, copy=copy) 108 return IntegerArray(values, mask) 109 /home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy) 190 ]: 191 raise TypeError( --> 192 "{} cannot be converted to an IntegerDtype".format(values.dtype) 193 ) 194 TypeError: object cannot be converted to an IntegerDtype ```

Expected Output

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.16-041816-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 0.25.1 numpy : 1.14.3 pytz : 2018.4 dateutil : 2.7.3 pip : 19.1.1 setuptools : 39.1.0 Cython : 0.28.2 pytest : 3.5.1 hypothesis : None sphinx : 1.7.4 blosc : None feather : None xlsxwriter : 1.0.4 lxml.etree : 4.2.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10 IPython : 6.4.0 pandas_datareader: None bs4 : 4.6.0 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.2.1 matplotlib : 3.1.1 numexpr : 2.6.5 odfpy : None openpyxl : 2.5.3 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.1.0 sqlalchemy : 1.2.7 tables : 3.4.3 xarray : None xlrd : 1.1.0 xlwt : 1.3.0 xlsxwriter : 1.0.4
TomAugspurger commented 4 years ago

I don't know if we want to support that automatically, since there's some ambiguity: Converting 'NaN' to NA is probably fine. But what about something like 'some text'.

I'd recommend using pd.to_numeric to get numeric values, and converting to nullable integer after that.

mar-ses commented 4 years ago

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

nrebena commented 4 years ago

Some follow up question:

Please tell me if I really did'nt understand the issue and am out of my depth.

jreback commented 4 years ago

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.

Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

mroeschke commented 4 years ago

Doesn't appear we have much appetite to support this. Thanks for the suggestion but we'd recommend using to_numeric first. Closing.

mar-ses commented 4 years ago

I must say I disagree on both points @mroeschke.

  1. Regarding the "appetite" I'm not sure how you measure that, there were people commenting and some likes.

  2. As I mentioned in my previous comment, to_numeric isn't sufficient. If there are any NaNs, it will first convert to float. That can potentially erase information, as a float cannot hold all the bits of a 64-bit integer.

  3. I gave an example of a situation where this is a problem, namely GAIA identifiers in astronomy, though there are probably other use cases, and in any case this is quite simply a bug. The code in the opening post should work, yet it doesn't. I think something within astype simply wasn't updated yet to reflect the fact that pandas now supports the new Int64 datatype. If pandas doesn't work as expected, people using it will need to spend a lot of time figuring out why and how to get around it. And before you say this is not a common use case, GAIA is essentially the biggest astronomical survey to date. Its data will be used extensively and is already being used, and the fact that this happens with the target/star identifiers means this issue will potentially affect almost everyone using that data that prefers pandas over the astropy.Table.

I think this should be reopened.

mar-ses commented 4 years ago

Some follow up question:

  • What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?
  • What about using a.astype(np.int64) ?
  • Also, is there a difference between int64 and Int64 (outside capitalization?)
  • If we are talking about identifiers, isn't uint64 even better?

Please tell me if I really did'nt understand the issue and am out of my depth.

Sorry for the late reply. I think I explained my issue poorly.

On my system I also have int64 by default. However, the issue is that int64 cannot hold missing/NaN values. That's why I need to use IntD64 (with the capital I), the new data type that allows int arrays to hold missing/NaN values:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

This is also why to_numeric doesn't work as it currently is; if it finds missing/NaN values - even if all the other values are int - it will convert to float. That's alright unless you're dealing with something like very long 64-bit integers, where the float significand can't hold all the digits of the integer.

My current workaround is to convert to float128 first, then to Int64, but the above is simply a bug. There is no reason why .astype('Int64') shouldn't work, yet it produces the error above when it tries to convert from strings.

mar-ses commented 4 years ago

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them. Moreover, as far as I can see, shouldn't .astype('Int64') and .to_numeric handle cases identically really? I mean I don't know the in-depth details of what .to_numeric does off the top of my head, but couldn't you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

What I mean is to_numeric first converts to float if it detects missing values, and it doesn't seem to want to convert ti Int64. Also, there is no reason astype shouldn't work here, the array of strings above can be converted to Int64.

jreback commented 4 years ago

@mar-ses if you like to contribute tests / patch to .to_numeric that would be greatness; we would / should support nullable integer type conversion there

mar-ses commented 4 years ago

I've never contributed to these big projects, and I assume I would need to understand the internals and the standard way these things are done inside pandas, so any recommendations on where to start reading etc...?

Additionall, would it not also make sense to do it with .astype too? Because if this is done in to_numeric, would that be with an argument, or would it have to automatically figure out that these are all ints with certain values missing? To me, it feels like it makes more sense with astype since there you directly what you want the final dtype to be, whereas to_numeric has to guess right?

mar-ses commented 4 years ago

I'm looking into it, wouldn't minded doing this then. So looking at to_numeric I believe the change would be in __lib.lib in the function maybe_convert_numeric here:

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx

As far as I can figure out, if it can't immediately convert to the normal int, it will then try to figure out what it can convert to on a value-by-value basis. So it try several possible types and makes an array for each, e.g. complexes, floats, 'uintsetc... Then it goes through the values and if it finds a null for example, it flags that a null was seen, and puts the values into thefloatsandcomplexesarrays but not theints` array.

For this to work then, it would also need to have a nullable integer array, but since this is done in cython, is that even possible? Is there a version of this nullable integer array in cython? Or otherwise, can the following object hold a NaN value: ndarray[int64_t] ints = np.empty(n, dtype='i8')?

Or should I create another array like:

ndarray[Int64_t] null_ints = np.empty(n, dtype='Int64') # or someting similar?

jreback commented 4 years ago

once this is merged (soon); https://github.com/pandas-dev/pandas/pull/27335

this will relatively straightforward to patch

maresb commented 4 years ago

Bumping this issue now since https://github.com/pandas-dev/pandas/pull/27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

jreback commented 4 years ago

Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?

i never understood what good bumping an issue in an open source all volunteer project actually means

maresb commented 4 years ago

I hope I didn't commit a faux pas. Since the anticipated merge recently took place, patching this issue is no longer blocked. I was trying to be helpful by drawing attention to this fact as a "bump". Sorry if that came across as pushy/annoying. I'm newly active on GitHub and still figuring out the social norms.

mar-ses commented 4 years ago

I'm also newby here. So I looked at this other issue a bit (the thing that's getting merged), and won't the update to maybe_convert_numeric fix the issue here too? I need to have a deeper look, sorry been very busy.

jreback commented 4 years ago

@maresb there are 3000 issues and all volunteer so issues get solved when folks contribute PRs

maresb commented 4 years ago

@jreback yes, that is so obvious that I'm surprised that you feel the need to point it out to me. I'd love to contribute, but it'll be several weeks before that's even possible.

In case you have a problem with my previous comment, I would appreciate some constructive feedback. I thought that I was being helpful and polite by alerting @mar-ses, since he previously expressed interest in contributing.