API: distinguish NA vs NaN in floating dtypes

jorisvandenbossche commented 4 years ago

Context: in the original pd.NA proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)

Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as "normal", unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).

Some other various considerations:

Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038
What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

jorisvandenbossche commented 4 years ago

With a long delay, I went again through all the above comments, and I tried to make a somewhat fair / objective summary of the discussion. This can be found here: https://hackmd.io/@jorisvandenbossche/ry4roQNcI (but not yet fully polished).

What follows below is my current personal opinion:

Short term, I would propose to use a masked array appraoch, using the mask to indicate missing (pd.NA) values. Mainly for the following reasons:
- I think we should take an approach that is maximally compatible with Apache Arrow. Giving the growing position of Apache Arrow as the standard in-memory format for tabular data, I think modelling a new, experimental dtype after Apache Arrow is the "Right Thing (tm)" to do (but I am clearly biased here given my work situation ..).
- A mask based appraoch is in line with the other nullable dtypes, like int/boolean, creating internal consistency (using the same mechanism to store NAs across dtypes) and ensures maximal code sharing for the nullable dtypes.
Moreover, this is also the most flexible solution for whatever NaN behaviour we want to end up with. In theory, you can still have a "FloatArray" where there are NaNs in the values at places not marked as missing in the mask. For example when manually constructing such a FloatArray(values, mask) from a values+mask combo (unless we would always check this, but this would be costly to always do). But it also keeps the option open to add checks to certain operations to eg let 0/0 result in NA instead of NaN (and thus in practice only have NA).
I think for an initial implementation allowing both NA/NaN in theory is probably "good enough" to be able to experiment with the new dtype and get more feedback on the different aspects of the discussion:
- Upon construction, when data is coming from Python/numpy, we convert any NaN into NA by default (long term, we could provide an option to turn this off if we decide to keep allowing NaNs). This will in practice mean that there are almost no cases where the user starts with NaNs in its data unexpectedly, except if it explicitly asked for it (or unless it uses data that knows the distinction, such as Parquet).
- Initially, I would not introduce special NaN-checking behaviour, and we just let np.nan, if present, be handled by numpy (e.g. in comparisons, NaN will keep giving False, while NA gives NA. Or in reductions it will propagate, etc). This actually gives the simplest implementation.
- For cases where new NaNs would be introduced (e.g. 0/0), we can further discuss (on a case by case basis, and based on user feedback) whether we want to special case to get NA instead of NaN. For example, given the discussions above, I think division with 0/0 is a reasonable case to result in NA instead of NaN. If we do this, I think it will be basically impossible for a user to run into NaNs even though they are allowed in theory (unless you explicitly start with NaNs).

So starting with a mask-based FloatArray implementation should allow us to start experimenting with it and get more feedback, while it also keeps all options open regarding how to handle NaNs: we can continue discussing the exact NaN semantics and go with whichever option we eventually decide upon.

I know this is kind of postponing decisions, but: 1) I think if there is one thing clear from the above discussions, it is that there is not yet a clear conclusion and 2) at least it can get us going with some practical implementations, and let us decide on one aspect of the discussion (the actual storage approach), while we can continue discuss the other aspect (NaN handling).

jbrockmendel commented 4 years ago

@jorisvandenbossche request for clarification on your previous comment: at what level do you think this should be implemented? (DataFrame/Series/Index/Block/EA/...)

jorisvandenbossche commented 4 years ago

All the above is about a new nullable float dtype, similar as the nullable integer and boolean dtypes. So this means a Float64Dtype and a Float64Array extension array (and float32, ..). And then this gets stored in DataFrame/Series/(Index)/Block just like the other nullable extension dtypes .. ?

(does that answer your question? I don't fully get what you are thinking about how this might be different than the nullable integer dtype we already have).

jbrockmendel commented 4 years ago

does that answer your question?

It does, thanks.

jorisvandenbossche commented 4 years ago

I opened a PR (https://github.com/pandas-dev/pandas/pull/34307) for this first part I proposed above: a masked array-based FloatingArray extension array (for now, in that PR, postponing the NA/NaN decision)

jbrockmendel commented 3 years ago

Going back over this thread, I mostly agree with Joris's last comment.

1) We should not break things for users that are currently happy with the float64 behavior. I expect most users don't ever think about the distinction between np.nan vs pd.NA, and we shouldn't make them learn about it. Users who opt in to Float64Dtype can get the fancier behavior.

2) Users shouldn't have to know anything about limits or log branches to understand "is 0/0 missing or nan" or "is log(-1) missing or nan". Let's propagate pd.NA when we see it and otherwise use np.nan.

3) If a user does series.iloc[0] = np.nan we should trust that they mean np.nan.

4) Changing .isna() for float64 would be a PITA for users (see point 1). Let's not do that. a) If a Float64Dtype Series contains np.nan and .isna says False for those entries, that affects .fillna and .dropna, where we generally expect to have no np.nans in the result. b) If we want to distinguish between pd.NA vs (pd.NA | np.nan) that calls for either a .ismissing or a keyword in .isna

5) The default behavior for __array__/to_numpy I'm on the fence about.
a) If there are no NAs present, then converting to float64 is a better user experience b) Otherwise, it depends on if the user cares about pd.NA vs np.nan. i) If they don't, then they should just be using float64 to begin with. ii) If they do, then they could reasonably want these retained by to_numpy c) Having to_numpy's result dtype depend on whether NAs are present is not desirable.

jreback commented 2 years ago

I agree here:

b) If we want to distinguish between pd.NA vs (pd.NA | np.nan) that calls for either a .ismissing or a keyword in .isna

we would need to introduce additional behavior to differentiate between the nan/NA.

Generally I think we would really need to see substantial demand for this feature as likey a lot of work to implement and cover all of the edge cases.

Dr-Irv commented 2 years ago

we would need to introduce additional behavior to differentiate between the nan/NA.

Generally I think we would really need to see substantial demand for this feature as likey a lot of work to implement and cover all of the edge cases.

I'm probably the one who has argued for differentiating between np.nan and pd.NA as "error in computation" versus "missing data". I've found this to be important in applications, where this differentiation helped find issues with underlying data or in a computation. For example:

>>> df = pd.DataFrame.from_records([[3.2, 1.6], [0.0, 0.0], [5.0, np.nan]], columns=["a", "b"]).convert_dtypes()
>>> df
     a     b
0  3.2   1.6
1  0.0   0.0
2  5.0  <NA>
>>> df.a/df.b
0     2.0
1     NaN
2    <NA>
dtype: Float64

Here it is useful to know that the first np.nan value was due to the values both being zero in the division, while the pd.NA value was due to one of the values missing. I've had projects where figuring this out (prior to the introduction of the floating extension dtypes) was a challenge, as there are two different errors in the source data. In one case, it is due to both values being zero (which was unexpected), and in the other case it was due to one of the values missing (which was OK).

So I think pandas should take a lead here in differentiating between "missing values" and "Not a number" and do what we can to make that differentiation clear.

jbrockmendel commented 2 years ago

I'm probably the one who has argued for differentiating between np.nan and pd.NA as "error in computation" versus "missing data". I've found this to be important in applications, where this differentiation helped find issues with underlying data or in a computation.

To get a sense of the room here, can I get a show of hands: does having np.nan/pd.NA be semantically distinct matter to you? Use reaction emojis to get a tally. Rocket emoji for "I don't care", Eyes emoji for "I don't care, as long as my existing workflows aren't affected."

jbrockmendel commented 2 years ago

Back when we were discussing implementing pd.NA in the first place there were a few comments to the effect of "when i teach pandas, having more than one NA value is a pain point", but I don't see them in #28095. Am I misremembering?

jorisvandenbossche commented 2 years ago

A general comment on the notion of "We should not break things for users that are currently happy with the float64 behavior" or "Changing .isna() for float64 would be a PITA for users": it is not because we would distinguish between NA and NaN in the pd.Float64Dtype and stop treating NaN as missing value in isna() (and thus dropna, fillna, etc) by default, that this would change actual end-user behaviour when people switch from np.float64 to pd.Float64Dtype.

As long as we convert NaN to NA on input / in constructors etc by default (what we currently do), users won't see a change in behaviour in practice for almost all usage. If you don't start with NaN in your data, the only ways that I am currently aware of how you can get NaNs through operations in pandas are 0/0 and np.log(negative_value). (there are probably other numpy operations that can also do that, it would be good to somehow list them).

Dr-Irv commented 2 years ago

If you don't start with NaN in your data, the only ways that I am currently aware of how you can get NaNs through operations in pandas are 0/0 and np.log(negative_value).

Another is np.sqrt(negative_value). The one that has hit me and my staff was the 0/0 case, where we had two series with mixes of numbers, zero values, and missing values. So when we did division of the series, it would have been very useful to know in the result which values from division were NA versus np.nan so we could differentiate doing the division with missing values versus the 0/0 case.

MichaelTiemannOSC commented 2 years ago

I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.

EDIT: note from @jorisvandenbossche: I hided this and the subsequent comments, as I think the question itself is resolved now, and further is slightly off-topic for the discussion here (the discussion is already very long, so hiding those comments makes it a bit easier to see/read the full discussion)

Dr-Irv commented 2 years ago

I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.

This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?

MichaelTiemannOSC commented 2 years ago

I came to this issue because I have a problem that crosses this issue. I have a pd.Series with a pint array dtype. Its values include a pd.NA value. When I attempt s.astype('pint[units]') the NA value cannot be converted to a float. But if the value is NaN, of course that is no problem. I can build a Pint Quantity with pd.NA no problem, it's just the astype insists on converting values to floats, and that throws a TypeError. I believe that if this issue is solved, its solution will guide the solution to my problem.

This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?

https://github.com/hgrecco/pint-pandas

Dr-Irv commented 2 years ago

This may have to do with the implementation of the "pint" array dtype, which is not part of pandas, but something someone else implemented. Can you point to that?

https://github.com/hgrecco/pint-pandas

I think this is an issue with the implementation of astype in that repo. For example, here is a pandas sample, where I create an "Int64" array, which has pd.NA in it, and then convert it to float, which converts the pd.NA to np.nan:

>>> import pandas as pd
>>> s=pd.Series([1,pd.NA,3], dtype="Int64")
>>> s
0       1
1    <NA>
2       3
dtype: Int64
>>> s.astype(float)
0    1.0
1    NaN
2    3.0
dtype: float64

The conversion in the pint case and in the pandas case is delegated to the parent array, so in your case it would be the pint implementation.

MichaelTiemannOSC commented 2 years ago

As I commented here on the pint-pandas issues queue, I agree that this is likely due to imperfect support of pd.NA:

https://github.com/hgrecco/pint-pandas/issues/88#issuecomment-1003934217

Alas, the issue that prompted this comment has been resolved, so now I need to check if this should become its own full-fledged issue for pint or pint-pandas.

Dr-Irv commented 2 years ago

With the current implementation of FloatArray, we do have this distinction in a simple use case:

>>> s = pd.Series(pd.array([1.0, 0.0, pd.NA, 2.0, np.nan], dtype="Float64"))
>>> s
0     1.0
1     0.0
2    <NA>
3     2.0
4    <NA>
dtype: Float64
>>> t = pd.Series(pd.array([0.0, 0.0, 3.0, 4.0, pd.NA], dtype="Float64"))
>>> t
0     0.0
1     0.0
2     3.0
3     4.0
4    <NA>
dtype: Float64
>>> q = s/t
>>> q
0     inf
1     NaN
2    <NA>
3     0.5
4    <NA>
dtype: Float64
>>> pd.isna(q)
0    False
1    False
2     True
3    False
4     True
dtype: bool
>>> q.sum()
nan
>>> f=q.astype(float)
>>> f
0    inf
1    NaN
2    NaN
3    0.5
4    NaN
dtype: float64
>>> f.sum()
inf
>>> f.iloc[1:].sum()
0.5

Doing the 0/0 calculation forces the np.nan value into the FloatArray. Not sure what would happen if then that is used in a groupby computation, although from the example above with q.sum(), I am guessing that the np.nan values get passed to cython, and it then does the sum and returns np.nan as the result.

jorisvandenbossche commented 2 years ago

Not sure what would happen if then that is used in a groupby computation, although from the example above with q.sum(), I am guessing that the np.nan values get passed to cython, and it then does the sum and returns np.nan as the result.

The groupby is an example of where this does not yet work out of the box. This is because it is not using numpy functions (which would propagate NaNs), but using a custom cython implementation which currently treats NaNs as missing values, and thus the NaN gets skipped by default:

>>> arr = pd.arrays.FloatingArray(np.array([np.nan, 1, 1, 1]), np.array([False, False, False, False]))
>>> arr
<FloatingArray>
[nan, 1.0, 1.0, 1.0]
Length: 4, dtype: Float64

>>> pd.Series(arr).groupby([0, 1, 0, 1]).sum()
Out[34]: 
0    1.0
1    2.0
dtype: Float64

The presence of np.inf we actually also don't really treat correctly for the plain numpy.float64 dtype in the groupby operation:

>>> pd.Series([np.inf, 1, 1, 1]).groupby([0, 1, 0, 1]).sum()
0    NaN
1    2.0
dtype: float64

# setting this option has no influence
>>> pd.options.mode.use_inf_as_na = True
>>> pd.Series([np.inf, 1, 1, 1]).groupby([0, 1, 0, 1]).sum()
0    NaN
1    2.0
dtype: float64

So this results in NaN in the result, while I suppose that the sum of [inf, 1] should be inf ?

So if we decide to go with distinguishing NA and NaNs as separate concepts, we will need to update the groupby algos for masked input to no longer special case NaN as missing.

jorisvandenbossche commented 2 years ago

One concrete example that @toobaz brought up above (https://github.com/pandas-dev/pandas/issues/32265#issuecomment-592959323) is the case of an empty mean giving NA but empty sum giving 0 (and thus potentially NaN). Showing this with a dummy example with a categorical with unobserved groups (but I think you can get the same with eg a resample operation where a period has no data):

# the example data
>>> df = pd.DataFrame({"key": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']),
...                    "values": pd.array([1, 2, 3], dtype="Float64")})
>>> df
  key  values
0   a     1.0
1   b     2.0
2   a     3.0
>>> df.dtypes
key       category
values     Float64
dtype: object

# Getting the sum gives 0 for the empty group, and thus dividing that with the size of the groups gives NaN
>>> df.groupby('key')['values'].sum()
key
a    4.0
b    2.0
c    0.0
Name: values, dtype: Float64

>>> df.groupby('key')['values'].sum() / df.groupby('key')['values'].size()
key
a    2.0
b    2.0
c    NaN
Name: values, dtype: Float64

# but calculating the same with mean() gives NA instead of NaN
>>> df.groupby('key')['values'].mean()
key
a     2.0
b     2.0
c    <NA>
Name: values, dtype: Float64

(in this specific case you can of course always use mean(), but there might be other situations where you want to do this manually for some reason)

Is it unfortunate that those two operations would give a different result (with the current behaviour of distinguishing NA and NaN) ?

jbrockmendel commented 2 years ago

So if we decide to go with distinguishing NA and NaNs as separate concepts, we will need to update the groupby algos for masked input to no longer special case NaN as missing.

Depending on if you're using a libgroupby function that has been adapted to use mask, if it sees that mask[i] is False it won't do any checking resembling isnan(values[i]).

User-wise, my only strong opinion remains that it should be essentially impossible to introduce a pd.NA except via I/O.

Implementation-wise, I agree with the arguments @jreback made on last week's call that this represents a massive API change and maintenance burden.

Dr-Irv commented 2 years ago

User-wise, my only strong opinion remains that it should be essentially impossible to introduce a pd.NA except via I/O.

We don't have that behavior now - see below.

>>> import pandas as pd
>>> s=pd.Series([1.2, 0.0, 5.6]).convert_dtypes()
>>> s.shift(1)
0    <NA>
1     1.2
2     0.0
dtype: Float64
>>> sf=pd.Series([1.2, 0.0, 5.6])
>>> sf
0    1.2
1    0.0
2    5.6
dtype: float64
>>> sf.shift(1)
0    NaN
1    1.2
2    0.0
dtype: float64
>>> s.shift(1)/[3.0, 2.0, 0.0]
0    <NA>
1     0.6
2     NaN
dtype: Float64
>>> sf.shift(1)/[3.0, 2.0, 0.0]
0    NaN
1    0.6
2    NaN

In this particular case, the differentiation of np.nan and pd.NA is important from the user perspective in the first division result. A member of my team had the second result, and it was due to a bug, but it was hard to find because the way that np.nan was introduced was different in the 2 cases. One np.nan occurred because of shift (or truly missing data), and the second occurred due to a bug (division by 0)

jbrockmendel commented 2 years ago

Thank you, .shift is a good example where my heuristic needs to be more fully fleshed out.

MarcoGorelli commented 1 year ago

I've finally read through the whole discussion

What I have most difficulty understanding is why .isna() would return False for np.nan.

if users ran df[~df.isna().any(axis=1)] and then found nans still there, it'd be very confusing. nans from mathematical operations aren't unusual. This'd be a huge breaking change;
cuDF allows both NA and nan, but their .isna() returns True for np.nan.

I think the cuDF behaviour is less surprising:

if a Series is constructed with np.nan, or from a source with missing values, then nan becomes NA
0/0 returns nan, and they keep it as such
isna() returns True for nan

Though my top preference would be to just have a single missing value (NA).

I'm also bothered by the mean() vs sum()/size() discrepancy, and I wouldn't be keen on intentionally introducing such an edge case.

TL;DR my position is:

preference for just having NA
if both NA and nan were allowed, then isna() would return True for nan

EDIT

I've had a change of heart here. OK to have both nan and NA, so long as there's an easy way to fill nans with NA

MarcoGorelli commented 1 year ago

Another point I'm struggling with is

The default behavior for array/to_numpy I'm on the fence about. a) If there are no NAs present, then converting to float64 is a better user experience

In this comment, it was pointed out that Int64 to numpy should be of dtype object, and that this shouldn't depend on whether or not there are missing values in the Series.

I'd find it a bit odd then to have

In [14]: pd.Series([1, 2, 3], dtype='Int64').to_numpy()
Out[14]: array([1, 2, 3], dtype=object)

In [15]: pd.Series([1, 2, 3], dtype='Float64').to_numpy()
Out[15]: array([1., 2., 3.])

Currently they both get turned to dtype=object, but in this issue it sounds like one of the suggestions in this thread is for Float64 to get converted to float64

MichaelTiemannOSC commented 1 year ago

xref https://github.com/lebigot/uncertainties/issues/169

For the above, it would be great if pd.isna() could return True for both np.nan and uncertainties with np.nan as a nominal value (or perhaps also error term).

a-reich commented 1 year ago

I just got bitten by this problem. I understand that the question of whether to differentiate the two values or treat the same/force a FloatingArray to only have pd.NA is a complex one, but it seems like a lack of movement in either direction is worse than either option. Right now, we can end up with NaNs without realizing it, and then bam, the tried and true method we always used to remove invalid data like isnull()/isna() will silently lead to downstream errors.

I basically second Marco's comment - if NaNs are allowed there should be a good way to handle them with the pandas API's for missing data. It's very confusing to a user that isna detects the same value, np.NaN, if the array happens to be float64 etc. but not Float64. In fact, the docs explicitly say it detects NaN, so isn't this a bug? Or if there isn't agreement on that, can parameters be added to these functions to enable handling both values? That's what pyarrow does.

a-reich commented 1 year ago

Would it be feasible to make progress on this in the 2.0 major release?

jbrockmendel commented 1 year ago

Would it be feasible to make progress on this in the 2.0 major release?

Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes.

a-reich commented 1 year ago

Would it be feasible to make progress on this in the 2.0 major release?

Honestly, not really. I'm planning to make a push on this for 3.0. In the medium-term my advice would be to not use nullable-float dtypes.

OK, cool. @jbrockmendel in the interim would people be open to a docs update to explain the current pitfalls (eg, the isna bug) to users?

jbrockmendel commented 1 year ago

Fine by me. @jorisvandenbossche might want to weigh in on the wording.

MarcoGorelli commented 1 year ago

TBH I've had a change of heart here - OK with having both, so long as there's a simple way of converting nan to NA

I don't think there currently is one, right? e.g.

In [24]: ser
Out[24]:
0     NaN
1     1.0
2     1.0
3    <NA>
dtype: Float64

In [25]: ser.fillna(None)

doesn't work, it raises ValueError: Must specify a fill 'value' or 'method'.

Dr-Irv commented 1 year ago

I don't think there currently is one, right? e.g.

Yes, there is:

>>> ser
0     NaN
1       1
2       1
3    <NA>
dtype: object
>>> ser.fillna(pd.NA)
0    <NA>
1       1
2       1
3    <NA>
dtype: object

MarcoGorelli commented 1 year ago

wait, that's not Float64 😄

In [37]: ser = pd.Series([0, 1, 2, None], dtype='Float64')

In [38]: ser /= ser

In [39]: ser
Out[39]:
0     NaN
1     1.0
2     1.0
3    <NA>
dtype: Float64

In [40]: ser.fillna(pd.NA)
Out[40]:
0     NaN
1     1.0
2     1.0
3    <NA>
dtype: Float64

Dr-Irv commented 1 year ago

Kind of hacky, but this works:

>>> ser = pd.Series([0, 1, 2, None], dtype='Float64')
>>> s0 = ser/ser
>>> s0
0     NaN
1     1.0
2     1.0
3    <NA>
dtype: Float64
>>> s0.mask(s0.astype(float).isna(), pd.NA)
0    <NA>
1     1.0
2     1.0
3    <NA>
dtype: Float64

MarcoGorelli commented 1 year ago

I think we may need to deprecate fillna in favour of fillnan (only fills nan) and fillnull (fills missing values), and then allow

In [40]: ser.fillnan(pd.NA)
Out[40]:
0     <NA>
1     1.0
2     1.0
3    <NA>

Likewise for

isna -> isnull & isnan
dropna -> dropnull & dropnan
notna -> notnull & notnan

EDIT

Or, alternatively, add add a keyword argument to tell whether the above should apply to 'missing' or 'nan' values

jbrockmendel commented 1 year ago

-0.9 on adding more api like that. Do we have third and fourth versions for None and NaT?

MarcoGorelli commented 1 year ago

I'd count both of those as 'missing' - I've just edited the comment though, noting that a keyword argument to tell what counts as 'na' would also be fine. Patrick's pointed out that Arrow has something similar https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null

Dr-Irv commented 1 year ago

I support the keyword idea like what Arrow has done

jbrockmendel commented 1 year ago

id be fine with a keyword, though i think we should have the default treat nan as na to keep behavior unchanged for users of non-masked dtypes.

jreback commented 1 year ago

I am -1 on have nan and NA

and in fact i think we should actually do a PDEP on pd.NA entirely - we didn't originally

MarcoGorelli commented 1 year ago

What do you think the expected output of

In [4]: ser = pd.Series([0, 1, 2, None], dtype='float64[pyarrow]')

In [5]: ser /= ser

In [6]: ser
Out[6]:
0     NaN
1     1.0
2     1.0
3    <NA>
dtype: double[pyarrow]

should be? That the first row 'NaN' be '<NA>'? And likewise for 'Float64' dtype

No objections, that would be simpler to be fair, but it would mean making sure that in operations which might introduce 'NaN' that '<NA>' be substituted

MarcoGorelli commented 1 year ago

How about keeping isna / fillna as they are, but just adding the nan_is_null keyword as suggested above?

Upon construction, when data is coming from Python/numpy, we convert any NaN into NA by default (long term, we could provide an option to turn this off if we decide to keep allowing NaNs)

If we're serious about allowing both, then I think that nan should not be converted to NA in the constructor. Otherwise, if you want to construct a Series like [1., 2., nan] with 'Float64' dtype, then you have to do something like

ser = pd.Series([1., 2., 0.], dtype='Float64')
other = pd.Series([1., 1., 0.], dtype='Float64')
ser = ser/other

If we don't want both nan and NA, then OK, but we need to make sure that every operation which introduces nan (like 0/0) has the nan replaced with NA, otherwise having isna behave differently for nullable vs non-nullable is going to be extremely confusing

Anyway, this will be a topic for discussion at the in-person sprint, hopefully we can make some progress there

MarcoGorelli commented 1 year ago

Just summarising my thoughts and position ahead of further discussions on this

For both Float64 and pyarrow-backed Float64, I think there's two main details to iron out:

construction: what should pd.Series([1., float('nan'), None], dtype='Float64') return?
mathematical operations: what should ser = pd.Series([0., 0.], dtype='Float64'); ser/ser return?
which should isna recognise?

I'd suggest one of the following two options:

Option 1: no NaN, only NA

In the above examples:

construction: it returns a Series with [1., NA, NA]
mathematical operations: it returns a Series with [NA, NA]

No need to add extra flags to isna. For non-nullable, it recognises NaN, for nullable it recognises NA.

Option 2: distinguish NaN and NA

NaN results from invalid mathematical operations, and is preserved under construction.

In the above examples:

construction: it returns a Series with [1., nan, NA]
mathematical operations: it returns a Series with [nan, nan]

isna would then need an extra nan_is_null flag which

if False would only recognise NA
if True, would recognise both NA and NaN (and this would probably be a good default)

Notes

Cases when data is truly missing (e.g. shift) would be NA in both cases

I'd have a slight preference for the latter, as it'd be more consistent with other tools. But I'd be fine with the former.

What I don't really like is that mixture present in the current behaviour:

NaN is converted to NA during construction
NaN can still be introduced with 0/0
there's no way for isna / fillna to treat nan as NA

mroeschke commented 1 year ago

If I am unable to participate in further discussion in this issue, I agree with Option 2: distinguish NaN and NA (especially with the myriad of xref issues here)

Dr-Irv commented 1 year ago

I strongly prefer option 2. I've had times where it was hard to tell if a computational error was due to missing data or something like a division by 0.

Probably worth specifying what happens in operations like Series.shift() or DataFrame.rolling().sum() that can insert na-like values in terms of whether they would insert np.nan versus pd.NA

MarcoGorelli commented 1 year ago

thanks - that would be pd.NA, nan would only be for "illegal"mathematical operations

phofl commented 1 year ago

What happens when astyping from numpy float with nan? Preserving as well?

MarcoGorelli commented 1 year ago

as in, pd.Series([1, float('nan')]).astype('Float64[pyarrow]')? If so, yes

jorisvandenbossche commented 1 year ago

If we're serious about allowing both, then I think that nan should not be converted to NA in the constructor. Otherwise, if you want to construct a Series like [1., 2., nan] with 'Float64' dtype, then you have to do something like
ser = pd.Series([1., 2., 0.], dtype='Float64')
other = pd.Series([1., 1., 0.], dtype='Float64')
ser = ser/other

It's mentioned somewhere hidden in the long thread above, but if we would keep this conversion, we could (and probably should) add a keyword here as well to control this, so you could do something like:

pd.Series([1., 2., np.nan], dtype="Float64", nan_as_null=False)

for an easier construction of a Series with a NaN.
(pyarrow essentially has this keyword as well in pa.array(..), under the name "from_pandas", where if you pass a pandas object, it treats the NaN as null)

[one of the three questions to answer] which should isna recognise?

I assume so, but just to be very explicit: whatever you say about isna then also applies to notna, dropna, fillna, .. right? So essentially the question is: should our missing-value-handling methods consider NaN as missing or not?

There might be a 4th question to answer: places where we have skipna logic, should those consider NaN as missing or not?

pandas-dev / pandas