Open jorisvandenbossche opened 4 years ago
With a long delay, I went again through all the above comments, and I tried to make a somewhat fair / objective summary of the discussion. This can be found here: https://hackmd.io/@jorisvandenbossche/ry4roQNcI (but not yet fully polished).
What follows below is my current personal opinion:
pd.NA
) values. Mainly for the following reasons:
FloatArray(values, mask)
from a values+mask combo (unless we would always check this, but this would be costly to always do). But it also keeps the option open to add checks to certain operations to eg let 0/0 result in NA instead of NaN (and thus in practice only have NA).0/0
is a reasonable case to result in NA instead of NaN. If we do this, I think it will be basically impossible for a user to run into NaNs even though they are allowed in theory (unless you explicitly start with NaNs).So starting with a mask-based FloatArray implementation should allow us to start experimenting with it and get more feedback, while it also keeps all options open regarding how to handle NaNs: we can continue discussing the exact NaN semantics and go with whichever option we eventually decide upon.
I know this is kind of postponing decisions, but: 1) I think if there is one thing clear from the above discussions, it is that there is not yet a clear conclusion and 2) at least it can get us going with some practical implementations, and let us decide on one aspect of the discussion (the actual storage approach), while we can continue discuss the other aspect (NaN handling).
@jorisvandenbossche request for clarification on your previous comment: at what level do you think this should be implemented? (DataFrame/Series/Index/Block/EA/...)
All the above is about a new nullable float dtype, similar as the nullable integer and boolean dtypes. So this means a Float64Dtype and a Float64Array extension array (and float32, ..). And then this gets stored in DataFrame/Series/(Index)/Block just like the other nullable extension dtypes .. ?
(does that answer your question? I don't fully get what you are thinking about how this might be different than the nullable integer dtype we already have).
does that answer your question?
It does, thanks.
I opened a PR (https://github.com/pandas-dev/pandas/pull/34307) for this first part I proposed above: a masked array-based FloatingArray
extension array (for now, in that PR, postponing the NA/NaN decision)
Going back over this thread, I mostly agree with Joris's last comment.
1) We should not break things for users that are currently happy with the float64 behavior. I expect most users don't ever think about the distinction between np.nan vs pd.NA, and we shouldn't make them learn about it. Users who opt in to Float64Dtype can get the fancier behavior.
2) Users shouldn't have to know anything about limits or log branches to understand "is 0/0 missing or nan" or "is log(-1) missing or nan". Let's propagate pd.NA when we see it and otherwise use np.nan.
3) If a user does series.iloc[0] = np.nan
we should trust that they mean np.nan.
4) Changing .isna() for float64 would be a PITA for users (see point 1). Let's not do that. a) If a Float64Dtype Series contains np.nan and .isna says False for those entries, that affects .fillna and .dropna, where we generally expect to have no np.nans in the result. b) If we want to distinguish between pd.NA vs (pd.NA | np.nan) that calls for either a .ismissing or a keyword in .isna
5) The default behavior for __array__
/to_numpy
I'm on the fence about.
a) If there are no NAs present, then converting to float64 is a better user experience
b) Otherwise, it depends on if the user cares about pd.NA vs np.nan.
i) If they don't, then they should just be using float64 to begin with.
ii) If they do, then they could reasonably want these retained by to_numpy
c) Having to_numpy's result dtype depend on whether NAs are present is not desirable.
I agree here:
b) If we want to distinguish between pd.NA vs (pd.NA | np.nan) that calls for either a .ismissing or a keyword in .isna
we would need to introduce additional behavior to differentiate between the nan/NA.
Generally I think we would really need to see substantial demand for this feature as likey a lot of work to implement and cover all of the edge cases.
we would need to introduce additional behavior to differentiate between the nan/NA.
Generally I think we would really need to see substantial demand for this feature as likey a lot of work to implement and cover all of the edge cases.
I'm probably the one who has argued for differentiating between np.nan
and pd.NA
as "error in computation" versus "missing data". I've found this to be important in applications, where this differentiation helped find issues with underlying data or in a computation. For example:
>>> df = pd.DataFrame.from_records([[3.2, 1.6], [0.0, 0.0], [5.0, np.nan]], columns=["a", "b"]).convert_dtypes()
>>> df
a b
0 3.2 1.6
1 0.0 0.0
2 5.0 <NA>
>>> df.a/df.b
0 2.0
1 NaN
2 <NA>
dtype: Float64
Here it is useful to know that the first np.nan
value was due to the values both being zero in the division, while the pd.NA
value was due to one of the values missing. I've had projects where figuring this out (prior to the introduction of the floating extension dtypes) was a challenge, as there are two different errors in the source data. In one case, it is due to both values being zero (which was unexpected), and in the other case it was due to one of the values missing (which was OK).
So I think pandas should take a lead here in differentiating between "missing values" and "Not a number" and do what we can to make that differentiation clear.
I'm probably the one who has argued for differentiating between np.nan and pd.NA as "error in computation" versus "missing data". I've found this to be important in applications, where this differentiation helped find issues with underlying data or in a computation.
To get a sense of the room here, can I get a show of hands: does having np.nan/pd.NA be semantically distinct matter to you? Use reaction emojis to get a tally. Rocket emoji for "I don't care", Eyes emoji for "I don't care, as long as my existing workflows aren't affected."
Back when we were discussing implementing pd.NA in the first place there were a few comments to the effect of "when i teach pandas, having more than one NA value is a pain point", but I don't see them in #28095. Am I misremembering?
A general comment on the notion of "We should not break things for users that are currently happy with the float64 behavior" or "Changing .isna() for float64 would be a PITA for users": it is not because we would distinguish between NA and NaN in the pd.Float64Dtype and stop treating NaN as missing value in isna() (and thus dropna, fillna, etc) by default, that this would change actual end-user behaviour when people switch from np.float64 to pd.Float64Dtype.
As long as we convert NaN to NA on input / in constructors etc by default (what we currently do), users won't see a change in behaviour in practice for almost all usage. If you don't start with NaN in your data, the only ways that I am currently aware of how you can get NaNs through operations in pandas are 0/0
and np.log(negative_value)
.
(there are probably other numpy operations that can also do that, it would be good to somehow list them).
If you don't start with NaN in your data, the only ways that I am currently aware of how you can get NaNs through operations in pandas are 0/0 and np.log(negative_value).
Another is np.sqrt(negative_value)
. The one that has hit me and my staff was the 0/0
case, where we had two series with mixes of numbers, zero values, and missing values. So when we did division of the series, it would have been very useful to know in the result which values from division were NA
versus np.nan
so we could differentiate doing the division with missing values versus the 0/0
case.
I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.
EDIT: note from @jorisvandenbossche: I hided this and the subsequent comments, as I think the question itself is resolved now, and further is slightly off-topic for the discussion here (the discussion is already very long, so hiding those comments makes it a bit easier to see/read the full discussion)
I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.
This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?
I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.
This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?
This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?
I think this is an issue with the implementation of astype
in that repo. For example, here is a pandas sample, where I create an "Int64" array, which has pd.NA
in it, and then convert it to float
, which converts the pd.NA
to np.nan
:
>>> import pandas as pd
>>> s=pd.Series([1,pd.NA,3], dtype="Int64")
>>> s
0 1
1 <NA>
2 3
dtype: Int64
>>> s.astype(float)
0 1.0
1 NaN
2 3.0
dtype: float64
The conversion in the pint
case and in the pandas case is delegated to the parent array, so in your case it would be the pint
implementation.
As I commented here on the pint-pandas issues queue, I agree that this is likely due to imperfect support of pd.NA:
https://github.com/hgrecco/pint-pandas/issues/88#issuecomment-1003934217
Alas, the issue that prompted this comment has been resolved, so now I need to check if this should become its own full-fledged issue for pint or pint-pandas.
With the current implementation of FloatArray
, we do have this distinction in a simple use case:
>>> s = pd.Series(pd.array([1.0, 0.0, pd.NA, 2.0, np.nan], dtype="Float64"))
>>> s
0 1.0
1 0.0
2 <NA>
3 2.0
4 <NA>
dtype: Float64
>>> t = pd.Series(pd.array([0.0, 0.0, 3.0, 4.0, pd.NA], dtype="Float64"))
>>> t
0 0.0
1 0.0
2 3.0
3 4.0
4 <NA>
dtype: Float64
>>> q = s/t
>>> q
0 inf
1 NaN
2 <NA>
3 0.5
4 <NA>
dtype: Float64
>>> pd.isna(q)
0 False
1 False
2 True
3 False
4 True
dtype: bool
>>> q.sum()
nan
>>> f=q.astype(float)
>>> f
0 inf
1 NaN
2 NaN
3 0.5
4 NaN
dtype: float64
>>> f.sum()
inf
>>> f.iloc[1:].sum()
0.5
Doing the 0/0
calculation forces the np.nan
value into the FloatArray
. Not sure what would happen if then that is used in a groupby
computation, although from the example above with q.sum()
, I am guessing that the np.nan
values get passed to cython, and it then does the sum and returns np.nan
as the result.
Not sure what would happen if then that is used in a
groupby
computation, although from the example above withq.sum()
, I am guessing that thenp.nan
values get passed to cython, and it then does the sum and returnsnp.nan
as the result.
The groupby is an example of where this does not yet work out of the box. This is because it is not using numpy functions (which would propagate NaNs), but using a custom cython implementation which currently treats NaNs as missing values, and thus the NaN gets skipped by default:
>>> arr = pd.arrays.FloatingArray(np.array([np.nan, 1, 1, 1]), np.array([False, False, False, False]))
>>> arr
<FloatingArray>
[nan, 1.0, 1.0, 1.0]
Length: 4, dtype: Float64
>>> pd.Series(arr).groupby([0, 1, 0, 1]).sum()
Out[34]:
0 1.0
1 2.0
dtype: Float64
The presence of np.inf
we actually also don't really treat correctly for the plain numpy.float64 dtype in the groupby operation:
>>> pd.Series([np.inf, 1, 1, 1]).groupby([0, 1, 0, 1]).sum()
0 NaN
1 2.0
dtype: float64
# setting this option has no influence
>>> pd.options.mode.use_inf_as_na = True
>>> pd.Series([np.inf, 1, 1, 1]).groupby([0, 1, 0, 1]).sum()
0 NaN
1 2.0
dtype: float64
So this results in NaN in the result, while I suppose that the sum of [inf, 1] should be inf ?
So if we decide to go with distinguishing NA and NaNs as separate concepts, we will need to update the groupby algos for masked input to no longer special case NaN as missing.
One concrete example that @toobaz brought up above (https://github.com/pandas-dev/pandas/issues/32265#issuecomment-592959323) is the case of an empty mean giving NA but empty sum giving 0 (and thus potentially NaN). Showing this with a dummy example with a categorical with unobserved groups (but I think you can get the same with eg a resample operation where a period has no data):
# the example data
>>> df = pd.DataFrame({"key": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']),
... "values": pd.array([1, 2, 3], dtype="Float64")})
>>> df
key values
0 a 1.0
1 b 2.0
2 a 3.0
>>> df.dtypes
key category
values Float64
dtype: object
# Getting the sum gives 0 for the empty group, and thus dividing that with the size of the groups gives NaN
>>> df.groupby('key')['values'].sum()
key
a 4.0
b 2.0
c 0.0
Name: values, dtype: Float64
>>> df.groupby('key')['values'].sum() / df.groupby('key')['values'].size()
key
a 2.0
b 2.0
c NaN
Name: values, dtype: Float64
# but calculating the same with mean() gives NA instead of NaN
>>> df.groupby('key')['values'].mean()
key
a 2.0
b 2.0
c <NA>
Name: values, dtype: Float64
(in this specific case you can of course always use mean()
, but there might be other situations where you want to do this manually for some reason)
Is it unfortunate that those two operations would give a different result (with the current behaviour of distinguishing NA and NaN) ?
So if we decide to go with distinguishing NA and NaNs as separate concepts, we will need to update the groupby algos for masked input to no longer special case NaN as missing.
Depending on if you're using a libgroupby function that has been adapted to use mask, if it sees that mask[i] is False it won't do any checking resembling isnan(values[i])
.
User-wise, my only strong opinion remains that it should be essentially impossible to introduce a pd.NA except via I/O.
Implementation-wise, I agree with the arguments @jreback made on last week's call that this represents a massive API change and maintenance burden.
User-wise, my only strong opinion remains that it should be essentially impossible to introduce a pd.NA except via I/O.
We don't have that behavior now - see below.
>>> import pandas as pd
>>> s=pd.Series([1.2, 0.0, 5.6]).convert_dtypes()
>>> s.shift(1)
0 <NA>
1 1.2
2 0.0
dtype: Float64
>>> sf=pd.Series([1.2, 0.0, 5.6])
>>> sf
0 1.2
1 0.0
2 5.6
dtype: float64
>>> sf.shift(1)
0 NaN
1 1.2
2 0.0
dtype: float64
>>> s.shift(1)/[3.0, 2.0, 0.0]
0 <NA>
1 0.6
2 NaN
dtype: Float64
>>> sf.shift(1)/[3.0, 2.0, 0.0]
0 NaN
1 0.6
2 NaN
In this particular case, the differentiation of np.nan
and pd.NA
is important from the user perspective in the first division result. A member of my team had the second result, and it was due to a bug, but it was hard to find because the way that np.nan
was introduced was different in the 2 cases. One np.nan
occurred because of shift
(or truly missing data), and the second occurred due to a bug (division by 0)
Thank you, .shift is a good example where my heuristic needs to be more fully fleshed out.
I've finally read through the whole discussion
What I have most difficulty understanding is why .isna()
would return False
for np.nan
.
df[~df.isna().any(axis=1)]
and then found nan
s still there, it'd be very confusing. nan
s from mathematical operations aren't unusual. This'd be a huge breaking change;NA
and nan
, but their .isna()
returns True
for np.nan
. I think the cuDF behaviour is less surprising:
np.nan
, or from a source with missing values, then nan
becomes NA
nan
, and they keep it as suchisna()
returns True
for nan
Though my top preference would be to just have a single missing value (NA
).
I'm also bothered by the mean()
vs sum()/size()
discrepancy, and I wouldn't be keen on intentionally introducing such an edge case.
TL;DR my position is:
NA
NA
and nan
were allowed, then isna()
would return True
for nan
I've had a change of heart here. OK to have both nan
and NA
, so long as there's an easy way to fill nan
s with NA
Another point I'm struggling with is
The default behavior for array/to_numpy I'm on the fence about. a) If there are no NAs present, then converting to float64 is a better user experience
In this comment, it was pointed out that Int64
to numpy
should be of dtype object
, and that this shouldn't depend on whether or not there are missing values in the Series.
I'd find it a bit odd then to have
In [14]: pd.Series([1, 2, 3], dtype='Int64').to_numpy()
Out[14]: array([1, 2, 3], dtype=object)
In [15]: pd.Series([1, 2, 3], dtype='Float64').to_numpy()
Out[15]: array([1., 2., 3.])
Currently they both get turned to dtype=object
, but in this issue it sounds like one of the suggestions in this thread is for Float64
to get converted to float64
xref https://github.com/lebigot/uncertainties/issues/169
For the above, it would be great if pd.isna() could return True for both np.nan and uncertainties with np.nan as a nominal value (or perhaps also error term).
I just got bitten by this problem. I understand that the question of whether to differentiate the two values or treat the same/force a FloatingArray to only have pd.NA is a complex one, but it seems like a lack of movement in either direction is worse than either option. Right now, we can end up with NaNs without realizing it, and then bam, the tried and true method we always used to remove invalid data like isnull()/isna()
will silently lead to downstream errors.
I basically second Marco's comment - if NaNs are allowed there should be a good way to handle them with the pandas API's for missing data. It's very confusing to a user that isna
detects the same value, np.NaN, if the array happens to be float64
etc. but not Float64
. In fact, the docs explicitly say it detects NaN, so isn't this a bug? Or if there isn't agreement on that, can parameters be added to these functions to enable handling both values? That's what pyarrow does.
Would it be feasible to make progress on this in the 2.0 major release?
Would it be feasible to make progress on this in the 2.0 major release?
Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes.
Would it be feasible to make progress on this in the 2.0 major release?
Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes.
OK, cool. @jbrockmendel in the interim would people be open to a docs update to explain the current pitfalls (eg, the isna bug) to users?
Fine by me. @jorisvandenbossche might want to weigh in on the wording.
TBH I've had a change of heart here - OK with having both, so long as there's a simple way of converting nan
to NA
I don't think there currently is one, right? e.g.
In [24]: ser
Out[24]:
0 NaN
1 1.0
2 1.0
3 <NA>
dtype: Float64
In [25]: ser.fillna(None)
doesn't work, it raises ValueError: Must specify a fill 'value' or 'method'.
I don't think there currently is one, right? e.g.
Yes, there is:
>>> ser
0 NaN
1 1
2 1
3 <NA>
dtype: object
>>> ser.fillna(pd.NA)
0 <NA>
1 1
2 1
3 <NA>
dtype: object
wait, that's not Float64
😄
In [37]: ser = pd.Series([0, 1, 2, None], dtype='Float64')
In [38]: ser /= ser
In [39]: ser
Out[39]:
0 NaN
1 1.0
2 1.0
3 <NA>
dtype: Float64
In [40]: ser.fillna(pd.NA)
Out[40]:
0 NaN
1 1.0
2 1.0
3 <NA>
dtype: Float64
Kind of hacky, but this works:
>>> ser = pd.Series([0, 1, 2, None], dtype='Float64')
>>> s0 = ser/ser
>>> s0
0 NaN
1 1.0
2 1.0
3 <NA>
dtype: Float64
>>> s0.mask(s0.astype(float).isna(), pd.NA)
0 <NA>
1 1.0
2 1.0
3 <NA>
dtype: Float64
I think we may need to deprecate fillna
in favour of fillnan
(only fills nan) and fillnull
(fills missing values), and then allow
In [40]: ser.fillnan(pd.NA)
Out[40]:
0 <NA>
1 1.0
2 1.0
3 <NA>
Likewise for
isna
-> isnull
& isnan
dropna
-> dropnull
& dropnan
notna
-> notnull
& notnan
Or, alternatively, add add a keyword argument to tell whether the above should apply to 'missing' or 'nan' values
-0.9 on adding more api like that. Do we have third and fourth versions for None and NaT?
I'd count both of those as 'missing' - I've just edited the comment though, noting that a keyword argument to tell what counts as 'na' would also be fine. Patrick's pointed out that Arrow has something similar https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null
I support the keyword idea like what Arrow has done
id be fine with a keyword, though i think we should have the default treat nan as na to keep behavior unchanged for users of non-masked dtypes.
I am -1 on have nan and NA
and in fact i think we should actually do a PDEP on pd.NA entirely - we didn't originally
What do you think the expected output of
In [4]: ser = pd.Series([0, 1, 2, None], dtype='float64[pyarrow]')
In [5]: ser /= ser
In [6]: ser
Out[6]:
0 NaN
1 1.0
2 1.0
3 <NA>
dtype: double[pyarrow]
should be? That the first row 'NaN'
be '<NA>'
? And likewise for 'Float64'
dtype
No objections, that would be simpler to be fair, but it would mean making sure that in operations which might introduce 'NaN'
that '<NA>'
be substituted
How about keeping isna
/ fillna
as they are, but just adding the nan_is_null
keyword as suggested above?
Upon construction, when data is coming from Python/numpy, we convert any NaN into NA by default (long term, we could provide an option to turn this off if we decide to keep allowing NaNs)
If we're serious about allowing both, then I think that nan
should not be converted to NA
in the constructor. Otherwise, if you want to construct a Series like [1., 2., nan]
with 'Float64'
dtype, then you have to do something like
ser = pd.Series([1., 2., 0.], dtype='Float64')
other = pd.Series([1., 1., 0.], dtype='Float64')
ser = ser/other
If we don't want both nan
and NA
, then OK, but we need to make sure that every operation which introduces nan
(like 0/0) has the nan
replaced with NA
, otherwise having isna
behave differently for nullable vs non-nullable is going to be extremely confusing
Anyway, this will be a topic for discussion at the in-person sprint, hopefully we can make some progress there
Just summarising my thoughts and position ahead of further discussions on this
For both Float64
and pyarrow-backed Float64, I think there's two main details to iron out:
pd.Series([1., float('nan'), None], dtype='Float64')
return?ser = pd.Series([0., 0.], dtype='Float64'); ser/ser
return?isna
recognise?I'd suggest one of the following two options:
In the above examples:
[1., NA, NA]
[NA, NA]
No need to add extra flags to isna
. For non-nullable, it recognises NaN
, for nullable it recognises NA
.
NaN results from invalid mathematical operations, and is preserved under construction.
In the above examples:
[1., nan, NA]
[nan, nan]
isna
would then need an extra nan_is_null
flag which
False
would only recognise NA
True
, would recognise both NA
and NaN
(and this would probably be a good default)Cases when data is truly missing (e.g. shift
) would be NA
in both cases
I'd have a slight preference for the latter, as it'd be more consistent with other tools. But I'd be fine with the former.
What I don't really like is that mixture present in the current behaviour:
isna
/ fillna
to treat nan
as NA
If I am unable to participate in further discussion in this issue, I agree with Option 2: distinguish NaN and NA
(especially with the myriad of xref issues here)
I strongly prefer option 2. I've had times where it was hard to tell if a computational error was due to missing data or something like a division by 0.
Probably worth specifying what happens in operations like Series.shift()
or DataFrame.rolling().sum()
that can insert na-like values in terms of whether they would insert np.nan
versus pd.NA
thanks - that would be pd.NA
, nan
would only be for "illegal"mathematical operations
What happens when astyping from numpy float with nan? Preserving as well?
as in, pd.Series([1, float('nan')]).astype('Float64[pyarrow]')
? If so, yes
If we're serious about allowing both, then I think that
nan
should not be converted toNA
in the constructor. Otherwise, if you want to construct a Series like[1., 2., nan]
with'Float64'
dtype, then you have to do something likeser = pd.Series([1., 2., 0.], dtype='Float64') other = pd.Series([1., 1., 0.], dtype='Float64') ser = ser/other
It's mentioned somewhere hidden in the long thread above, but if we would keep this conversion, we could (and probably should) add a keyword here as well to control this, so you could do something like:
pd.Series([1., 2., np.nan], dtype="Float64", nan_as_null=False)
for an easier construction of a Series with a NaN.
(pyarrow essentially has this keyword as well in pa.array(..)
, under the name "from_pandas
", where if you pass a pandas object, it treats the NaN as null)
[one of the three questions to answer] which should
isna
recognise?
I assume so, but just to be very explicit: whatever you say about isna
then also applies to notna
, dropna
, fillna
, .. right?
So essentially the question is: should our missing-value-handling methods consider NaN as missing or not?
There might be a 4th question to answer: places where we have skipna
logic, should those consider NaN as missing or not?
Context: in the original
pd.NA
proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic aboutpd.NA
vsnp.nan
was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context ofnp.nan
for float aspd.NaT
for datetime-like).With the introduction of
pd.NA
, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:np.nan
as we do now, but change its behaviour (e.g. in comparison ops) to matchpd.NA
pd.NA
in float dtypesPersonally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of
np.nan
.For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)
Actual discussion items: assume we are going to add floating dtypes that use
pd.NA
as missing value indicator. Then the following question comes up:So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (
np.nan
as "normal", unmasked value in the actual data,pd.NA
tracked in the mask). But we also need to decide if we want this.This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:
vs
So I think those two describe nicely the two options we have on the question do we want both
pd.NA
andnp.nan
in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically
0/0
) ?A dummy example showing how both can occur:
The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).
So, yes, it is possible and potentially desirable to allow both
pd.NA
andnp.nan
in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions likeisna
/notna
/dropna
/fillna
? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have askipna
keyword, like sum, mean, etc)?Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).
Some other various considerations:
Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a
to_numpy(.., na_value=np.nan)
explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038
What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (
pdSeries(np.array([0.1, np.nan]))
) Do we convert NaNs to NA automatically by default?cc @pandas-dev/pandas-core @Dr-Irv @dsaxton