Closed mar-ses closed 4 years ago
I don't know if we want to support that automatically, since there's some ambiguity: Converting 'NaN'
to NA
is probably fine. But what about something like 'some text'
.
I'd recommend using pd.to_numeric
to get numeric values, and converting to nullable integer after that.
The issue is that with missing data, to_numeric
will convert to float
first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64
. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.
Moreover, as far as I can see, shouldn't .astype('Int64')
and .to_numeric
handle cases identically really? I mean I don't know the in-depth details of what .to_numeric
does off the top of my head, but couldn't you make .astype('Int64')
follow the same rules regarding ambiguous cases?
Some follow up question:
a.astype(np.int64)
?Please tell me if I really did'nt understand the issue and am out of my depth.
The issue is that with missing data,
to_numeric
will convert tofloat
first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking intoInt64
. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them.Moreover, as far as I can see, shouldn't
.astype('Int64')
and.to_numeric
handle cases identically really? I mean I don't know the in-depth details of what.to_numeric
does off the top of my head, but couldn't you make.astype('Int64')
follow the same rules regarding ambiguous cases?
we don’t convert to float first
to_numeric is the workhorse
astype doesn’t have any options meaning all values must be convertible like in numpy
Doesn't appear we have much appetite to support this. Thanks for the suggestion but we'd recommend using to_numeric
first. Closing.
I must say I disagree on both points @mroeschke.
Regarding the "appetite" I'm not sure how you measure that, there were people commenting and some likes.
As I mentioned in my previous comment, to_numeric
isn't sufficient. If there are any NaNs, it will first convert to float. That can potentially erase information, as a float cannot hold all the bits of a 64-bit integer.
I gave an example of a situation where this is a problem, namely GAIA identifiers in astronomy, though there are probably other use cases, and in any case this is quite simply a bug. The code in the opening post should work, yet it doesn't. I think something within astype
simply wasn't updated yet to reflect the fact that pandas now supports the new Int64
datatype. If pandas doesn't work as expected, people using it will need to spend a lot of time figuring out why and how to get around it. And before you say this is not a common use case, GAIA is essentially the biggest astronomical survey to date. Its data will be used extensively and is already being used, and the fact that this happens with the target/star identifiers means this issue will potentially affect almost everyone using that data that prefers pandas
over the astropy.Table
.
I think this should be reopened.
Some follow up question:
- What is the type on a.astype(int). On my system it is int64, but it may differs on yours. Is that part of the problem?
- What about using
a.astype(np.int64)
?- Also, is there a difference between int64 and Int64 (outside capitalization?)
- If we are talking about identifiers, isn't uint64 even better?
Please tell me if I really did'nt understand the issue and am out of my depth.
Sorry for the late reply. I think I explained my issue poorly.
On my system I also have int64 by default. However, the issue is that int64 cannot hold missing/NaN values. That's why I need to use IntD64 (with the capital I), the new data type that allows int
arrays to hold missing/NaN values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
This is also why to_numeric
doesn't work as it currently is; if it finds missing/NaN values - even if all the other values are int
- it will convert to float. That's alright unless you're dealing with something like very long 64-bit integers, where the float significand can't hold all the digits of the integer.
My current workaround is to convert to float128
first, then to Int64
, but the above is simply a bug. There is no reason why .astype('Int64')
shouldn't work, yet it produces the error above when it tries to convert from strings.
The issue is that with missing data,
to_numeric
will convert tofloat
first right? In which case, if you're talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking intoInt64
. It's not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them. Moreover, as far as I can see, shouldn't.astype('Int64')
and.to_numeric
handle cases identically really? I mean I don't know the in-depth details of what.to_numeric
does off the top of my head, but couldn't you make.astype('Int64')
follow the same rules regarding ambiguous cases?we don’t convert to float first
to_numeric is the workhorse
astype doesn’t have any options meaning all values must be convertible like in numpy
What I mean is to_numeric
first converts to float
if it detects missing values, and it doesn't seem to want to convert ti Int64
. Also, there is no reason astype
shouldn't work here, the array of strings above can be converted to Int64
.
@mar-ses if you like to contribute tests / patch to .to_numeric that would be greatness; we would / should support nullable integer type conversion there
I've never contributed to these big projects, and I assume I would need to understand the internals and the standard way these things are done inside pandas, so any recommendations on where to start reading etc...?
Additionall, would it not also make sense to do it with .astype
too? Because if this is done in to_numeric
, would that be with an argument, or would it have to automatically figure out that these are all ints with certain values missing? To me, it feels like it makes more sense with astype
since there you directly what you want the final dtype to be, whereas to_numeric
has to guess right?
I'm looking into it, wouldn't minded doing this then. So looking at to_numeric
I believe the change would be in __lib.lib
in the function maybe_convert_numeric
here:
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx
As far as I can figure out, if it can't immediately convert to the normal int
, it will then try to figure out what it can convert to on a value-by-value basis. So it try several possible types and makes an array for each, e.g. complexes
, floats
, 'uintsetc... Then it goes through the values and if it finds a null for example, it flags that a null was seen, and puts the values into the
floatsand
complexesarrays but not the
ints` array.
For this to work then, it would also need to have a nullable integer array, but since this is done in cython, is that even possible? Is there a version of this nullable integer array in cython? Or otherwise, can the following object hold a NaN
value: ndarray[int64_t] ints = np.empty(n, dtype='i8')
?
Or should I create another array like:
ndarray[Int64_t] null_ints = np.empty(n, dtype='Int64') # or someting similar?
once this is merged (soon); https://github.com/pandas-dev/pandas/pull/27335
this will relatively straightforward to patch
Bumping this issue now since https://github.com/pandas-dev/pandas/pull/27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?
Bumping this issue now since #27335 has been merged. I'd really like to see this, but I personally don't have time at the moment. @mar-ses, are you still up for looking into this?
i never understood what good bumping an issue in an open source all volunteer project actually means
I hope I didn't commit a faux pas. Since the anticipated merge recently took place, patching this issue is no longer blocked. I was trying to be helpful by drawing attention to this fact as a "bump". Sorry if that came across as pushy/annoying. I'm newly active on GitHub and still figuring out the social norms.
I'm also newby here. So I looked at this other issue a bit (the thing that's getting merged), and won't the update to maybe_convert_numeric
fix the issue here too? I need to have a deeper look, sorry been very busy.
@maresb there are 3000 issues and all volunteer so issues get solved when folks contribute PRs
@jreback yes, that is so obvious that I'm surprised that you feel the need to point it out to me. I'd love to contribute, but it'll be several weeks before that's even possible.
In case you have a problem with my previous comment, I would appreciate some constructive feedback. I thought that I was being helpful and polite by alerting @mar-ses, since he previously expressed interest in contributing.
Edited to add information.
Code Sample, a copy-pastable example if possible
Problem description
Currently, the conversion of object dtypes (containing strings) to Int64 doesn't work, even though it should be able to. It produces a long error (see at the end).
Important to note: the above is trying to convert to
Int64
with the capital I. Those are the new nullable-integer arrays that got added topython
.pandas
seems to support them, yet I think something insideastype
wasn't update to reflect that.In essence, the above should work; there is no reason why it should fail and it's quite simply a bug (in answer to some comments). Moreover,
to_numeric
is not a sufficient replacement here; it doesn't convert toInt64
when there are missing datatypes, instead it converts tofloat
automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).Traceback:
Expected Output
Output of
pd.show_versions()