API: distinguish NA vs NaN in floating dtypes

jorisvandenbossche commented 4 years ago

Context: in the original pd.NA proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)

Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as "normal", unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).

Some other various considerations:

Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038
What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

jorisvandenbossche commented 4 years ago

How do other tools / languages deal with this?

Julia has both as separate concepts:

julia> arr = [1.0, missing, NaN]
3-element Array{Union{Missing, Float64},1}:
   1.0     
    missing
 NaN       

julia> ismissing.(arr)
3-element BitArray{1}:
 false
  true
 false

julia> isnan.(arr)
3-element Array{Union{Missing, Bool},1}:
 false       
      missing
  true

R also has both, but will treat NaN as missing in is.na(..):

> v <- c(1.0, NA, NaN)
> v
[1]   1  NA NaN
> is.na(v)
[1] FALSE  TRUE  TRUE
> is.nan(v)
[1] FALSE FALSE  TRUE

Here, the "skipna" na.rm keyword also skips NaN (na.rm docs: "logical. Should missing values (including NaN) be removed?"):

> sum(v)
[1] NA
> sum(v, na.rm=TRUE)
[1] 1

Apache Arrow also has both (NaN can be a float value, while it tracks missing values in a mask). It doesn't yet have much computational tools, bug eg the sum function skips missing values by default but will propagate NaN (like numpy's sum does for float NaN).

I think SQL also has both, but didn't yet check in more detail how it handles NaN in missing-like operations.

toobaz commented 4 years ago

I still don't know the semantics of pd.NA enough to judge in detail, but I am skeptical on whether users do benefit from two distinct concepts. If as a user I divide 0 by 0, it's perfectly fine to me to consider the result as "missing". Even more so because when done in non-vectorized Python, it raises an error, not returning some "not a number" placeholder. I suspect the other languages (e.g. at least R) have semantics which are more driven by implementation than by user experience. And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.

So ideally pd.NA and np.nan should be the same to users. If, as I understand, this is not possible given how pd.NA was designed and the compatibility we want to (rightfully) keep with numpy, I think the discrepancies should be limited as much as possible.

toobaz commented 4 years ago

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

TomAugspurger commented 4 years ago

Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me.

Agreed.

do we want both pd.NA and np.nan in a float dtype and have them signify different things?

My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).

jreback commented 4 years ago

agree with Tom here

I think R is even go too far as this introduces enormous mental complexity; now I have 2 missing values? sure for the advanced user this might be ok but most don’t care and this adds to the development burden

that said if we could support both np.nan and pd.NA with limited complexity;

as propagating values (and both fillable) IOW they are basically the same except that we do preserve the fact that a np.nan can arise from a mathematical operatio

then would be onboard.

On Feb 26, 2020, at 6:31 AM, Tom Augspurger notifications@github.com wrote:

Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me.

Agreed.

do we want both pd.NA and np.nan in a float dtype and have them signify different things?

My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Dr-Irv commented 4 years ago

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix. I've had cases where the source data (or some other calculation) I did produced a NaN, which pandas treats as missing, and the true source of the problem was either back in the source data (e.g., that data should not have been missing) or a bug elsewhere in my code. So in these cases, where the NaN was introduced due to a bug in the source data or in my code, my later calculations were perfectly happy because to pandas, the NaN meant "missing". Finding this kind of bug is non-trivial.

I think we should support np.nan and pd.NA. To me, the complexity is in a few places: 1) The transition for users so they know that np.nan won't mean "missing" in the future needs to be carefully thought out. Maybe we consider a global option to control this behavior? 2) Going back and forth between pandas and numpy (and maybe other libraries). If we eventually have np.nan and pd.NA mean "Not a number" and "missing", respectively, and numpy (or another library) treats np.nan as "missing", do we automate the conversions (both going from pandas to numpy/other or ingesting from numpy/other into pandas)

We currently also have this inconsistent (IMHO) behavior which relates to (2) above:

>>> s=pd.Series([1,2,pd.NA], dtype="Int64")
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.to_numpy()
array([1, 2, <NA>], dtype=object)
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.astype(float).to_numpy()
array([ 1.,  2., nan])

toobaz commented 4 years ago

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

Dr-Irv commented 4 years ago

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

When I said "such a calculation could indicate something wrong in the data that you need to identify and fix.", the thing that could be wrong in the data might not be missing data. It could be that some combination of values occurred that were not supposed to happen.

There are just two use cases here. One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

toobaz commented 4 years ago

One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

toobaz commented 4 years ago

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

This is true. But... are there new usability insights compared to those we had back in 2017?

Dr-Irv commented 4 years ago

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

shoyer commented 4 years ago

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

+1 for consistency with other computational tools.

On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.

dsaxton commented 4 years ago

I think using NA even for missing floats makes a lot of sense. In my opinion the same argument that NaN is semantically misleading for missing strings applies equally well to numeric data types.

It also seems trying to support both NaN and NA might be too complex and could be a significant source of confusion (I would think warnings / errors are the way to deal with bad computations rather than a special value indicating "you shouldn't have done this"). And if we're being pedantic NaN doesn't tell you whether you're dealing with 0 / 0 or log(-1), so it's technically still NA. :)

jbrockmendel commented 4 years ago

And if we're being pedantic NaN doesn't tell you whether you're dealing with 0 / 0 or log(-1), so it's technically still NA.

I propose that from now on we use a branch of log with a branch cut along the positive imaginary axis, avoiding this problem entirely.

jorisvandenbossche commented 4 years ago

Thanks all for the discussion!

[Pietro] And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.

I think there is, or at least, we now have one: for the new pd.NA, we decided that it propagates in comparisons, while np.nan gives False in comparisons (based on numpy behaviour, based on floating spec). Whether this is "natural" I don't know, but I think it is somewhat logical to do.

[Jeff] this introduces enormous mental complexity; now I have 2 missing values?

Note that it's not necessarily "2 missing values", but rather a "missing value" and a "not a number". Of course, current users are used to see NaN as a missing value. For those, there is of course initial confusion to no longer see NaN as a missing value. And this is certainly an aspect not to underestimate.

[Irv] Maybe we consider a global option to control this behavior?

There is already one for infinity (which is actually very similar to NaN, see more below): pd.options.mode.use_inf_as_na (default False). We could have a similar one for NaN (or a combined one).

[Stephan] +1 for consistency with other computational tools.

Yes, I agree it would be nice to follow numpy for those cases that numpy handles (which is things that result in NaN, like 0/0). Having different behaviour for pd.NA is fine I think (like the different propagation in comparison ops), since numpy doesn't have that concept (so we can't really "deviate" from numpy).

From talking with @TomAugspurger and looking at examples, I somewhat convinced myself that making the distinction makes sense (not sure if it convinced @TomAugspurger also, though ;), and there are still a lot of practical concerns)
Consider the following example:

>>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")  
>>> s   
0    NaN
1    inf
2    NaN
dtype: float64

>>> s.isna()
0     True
1    False
2     True
dtype: bool

The above is the current behaviour (where the original NA from Int64 dtype also gives NaN in float, but with a potential new float dtype, the third value would be instead of NaN). So here, 0 / 0 gives NaN, which is considered missing, while 1 / 0 gives inf, which is not considered missing. Is there a good reason for that difference? And did we in practice get much complaints or have we seen much user confusion about 1 / 0 resulting in inf and not being regarded as missing?

Based on that, I think the following (hypothetical) behaviour actually makes sense:

>>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")  
>>> s   
0     NaN
1     inf
2    <NA>
dtype: float64

>>> s.isna()
0    False
1    False
2     True
dtype: bool

As long as we ensure when creating a new "nullable float" series, that missing values (NA) are used and not NaN (unless the user explicitly asks for that), I think most users won't often run into having a NaN, or not that much more often than Inf (which already has the "non-missing" behaviour).

jorisvandenbossche commented 4 years ago

On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.

@shoyer I agree the object dtype is poor user experience. I think we opted (for now) for object dtype, since this is kind of the most "conservative" option: it at least "preserves the information", although in such a mostly useless way that it's up to the user to decide how to convert it properly. But indeed in most cases, users will then probably need to do .to_numpy(float, na_value=np.nan) (eg that's what scikit-learn will need to do). And if that is what most users will need, shouldn't it just be the default? I find this a hard one .. (as on the other hand, it's also not nice that the default array you get from np.asarray(..) has quite different behaviour for the NaNs compared to the original NAs).

Another hard topic, in case we no longer see np.nan as missing in a new nullable float dtype, will be how to treat nans in numpy arrays. For example, what should pd.isna(np.array([np.nan], dtype=float) do? What should pd.Series(np.array([np.nan]), dtype=<nullable float>) do? For the conversion from numpy array to Series, I think the default should be to convert NaNs to NA (since most people will have their missing values as NaN in numpy arrays, and so want it as NA in pandas). But if we do that, it would be strange that pd.isna would not return True for np.nan in a numpy array. But if returning True in that case, that would then conflict with returning False for np.nan if being in a nullable Series ...

toobaz commented 4 years ago

I think there is, or at least, we now have one: for the new pd.NA, we decided that it propagates in comparisons, while np.nan gives False in comparisons (based on numpy behaviour, based on floating spec). Whether this is "natural" I don't know, but I think it is somewhat logical to do.

My opinion is that the new pd.NA behaves under this respect in a more "natural" way than the floating spec - at least in a context in which users work with several different dtypes. Hence I respect the decision to deviate. I just would limit the deviation as much as possible. To be honest (but that's maybe another discussion, and I didn't think much about the consequences) I would be tempted to completely eliminate np.nan from floats (replace with pd.NA), to solve this discrepancy (even at the cost of deviating from numpy).

Consider the following example:

Actually, your example reinforces my opinion on not making the distinction (where possible).

So here, 0 / 0 gives NaN, which is considered missing, while 1 / 0 gives inf, which is not considered missing. Is there a good reason for that difference?

In [2]: pd.Series([-1, 0, 1]) / pd.Series([0, 0, 0])                                                                                                                                                                                                                                                                                                                       
Out[2]: 
0   -inf
1    NaN
2    inf
dtype: float64

1 / 0 gives inf: this clearly suggests about some limit (to 0 from right); -1 / 0 gives -inf: same story; 0 / 0 gives NaN. Why? Clearly because depending on how you converge to 0 in the numerator, you could have 0, inf, or any finite number. So this NaN really talks about missing information, not about some "magic" or "unrepresentable" floating point number. Same holds with np.inf + np.inf vs. np.inf - np.inf. Compare to np.log([-1, 1]), which produces NaN not because any information is missing, but because the result is not representable as a real number.

What I mean is: in the floating specs, NaN already denotes two different cases: of missing information, and of "unfeasible [within real numbers]" operation (together with any combinations of those - in particular when you propagate NaNs).

I know we all have in mind the distinction "I find missing values in my data" vs. "I produce missing values while handling the data". But this is a dangerous distinction to shape API, because the data which is an input for someone was an input for someone else. Are we really saying that if I do 0/0 it is "not a number", while if my data provider does exactly the same thing before passing me the data it is "missing data" that should behave differently?! What if at some step of my data pipeline... I am my data provider? Should we make np.NaN persist as pd.NA any time we save data to disk?!

jorisvandenbossche commented 4 years ago

[about the NaN as result from 0/0] So this NaN really talks about missing information, not about some "magic" or "unrepresentable" floating point number.

Sorry @toobaz, I don't understand your reasoning here (or just disagree, that's also possible). 0 and 0 are clearly both non-missing values in your data, so for me this "clearly" is not a case of talking missing information, but rather an unrepresentable floating point number. 0 and 0 can both be perfectly valid values in both series, it's only their combination and the specific operation that makes them invalid.

Also, you then say that np.log([-1]) gives a NaN not because of missing information. So would you then propose to have 0/0 become pd.NA but keep np.log(-1) as resulting in np.nan?

Are we really saying that if I do 0/0 it is "not a number", while if my data provider does exactly the same thing before passing me the data it is "missing data" that should behave differently?! What if at some step of my data pipeline... I am my data provider? Should we make np.NaN persist as pd.NA any time we save data to disk?!

That's indeed a problem if this is roundtripping through numpy, in which case we can't make the distinction (eg if you receive the data from someone else as a numpy array). For several file formats, though, we will be able to make the distinction. For example binary formats like parquet support both, and in principle also in csv we could support the distinction (although this would not be backwards compatible).

toobaz commented 4 years ago

0 and 0 are clearly both non-missing values in your data, so for me this "clearly" is not a case of talking missing information, but rather an unrepresentable floating point number.

Why do 1/0 and 0/0 - both of which, strictly speaking, have no answer (even outside reals) - lead to different results? The only explanation I can see is that you can imagine 1/0 as a limit tending to infinity, while in the case of 0/0 you really have no clue. That "no clue" for me means "missing information", not "error". If the problem was "arithmetic error", you'd have "1/0 = error" as well.

Now, I'm not saying I can read in the mind of whoever wrote the standard, or that I particularly agree with this choice, but this really reminds me (together with my example above about monthly averages, which is maybe more practical) that the difference between "missing" and "invalid" is very subtle, so much so that our intuition about what is missing or not seems already different from that which is present in the IEEE standard.

Also, you then say that np.log([-1]) gives a NaN not because of missing information. So would you then propose to have 0/0 become pd.NA but keep np.log(-1) as resulting in np.nan?

... I'm taking this as something we would consider if we distinguish the two concepts. And since it's everything but obvious (to me at least), I consider this as an argument for not distinguishing the two concepts.

That's indeed a problem if this is roundtripping through numpy, in which case we can't make the distinction (eg if you receive the data from someone else as a numpy array).

I was actually not making a point of "we are constrained by implementation", but really of "what should we conceptually do?". Do we want np.NaN as different from pd.NA because it helps us identify code errors we might want to solve? OK, then once we give up fixing the error in code (for instance because the 0/0 legitimately comes from an average on no observations) we should replace it with pd.NA. Creating np.NaN might be perfectly fine, but distributing it (on pd.NA-aware formats) would be akin to a programming mistake. We are really talking about the result of elementary operations which would (have to) become very context-dependent.

Anyway, if my arguments so far are not seen as convincing, I propose another approach: let us try to define which pandas operations currently producing np.NaN should start to produce pd.NA if we wanted to distinguish the two.

For instance: if data['employee'] is a categorical including empty categories, what should data.groupby('employee')['pay'].mean() return for such categories? pd.NA by default, I guess: there is no data...

What should data.groupby('worker')['pay'].sum() / data.groupby('worker').size() return in those cases? It's a 0/0, so np.NaN.

But these are really the same mathematical operation.

OK, so maybe we would solve the inconsistency if data.groupby('worker')['pay'].sum() already returned pd.NA for such categories. And in general - for consistenty - for sums of empty lists. But we already have Series.sum(min_count=) which has the opposite default behavior, and for very good reasons: the sum of empty lists often has nothing to do with missing data. After a parallel processing operation, how much time did a CPU spend processing tasks if it happened to not process any? Simple: 0. There's no missing data whatsoever.

dsaxton commented 4 years ago

I think what @toobaz is saying is that 0 / 0 truly is indeterminate (if we think of it as the solution to 0x = 0, then it's essentially any number, which isn't too different from the meaning of NA). The log(-1) case is maybe less obvious, but I think you could still defend the choice to represent this as NA (assuming you're not raising an error or using complex numbers) by saying that you're returning "no answer" to these types of queries (and that way keep the meaning as missing data).

I guess I'm still unsure what would be the actual utility of having another value to represent "bad data" when you already have NA for null values? If you're expecting to see a number and don't (because you've taken 0 / 0 for example), how much more helpful is it to see NaN instead of NA?

To me this doesn't seem worth the potential confusion of always having to code around two null values (it's even not obvious if we should treat NaN as missing under this new interpretation; if the answer is no then do we now have to check for two things in places where otherwise we would just ask if something is NA?), and having to remember that they each behave differently. Using only NA would also seemingly make it easier to translate from numpy to pandas (np.nan is always pd.NA, rather than sometimes pd.NA, and other times np.nan depending on context)

(A bit of a tangent from this thread, but reading about infinity above made me wonder if this could also be a useful value to have in other non-float dtypes, for instance infinite Int64 or Datetime values?)

shoyer commented 4 years ago

I am coming around to the idea that distinguishing between NaN and NA may not be worth the trouble. I think it would be pretty reasonable to both:

Always use NA instead of NaN for floating point values in pandas. This would change semantics for comparisons, but otherwise would be equivalent. It would not be possible to put NaN in a float array.
Transparently convert NaN -> NA and NA -> NaN when going back and forth with NumPy arrays. This would go a long ways for compatibility the existing ecosystem (e.g., xarray and scikit-learn). I really don't think anyone wants object dtype arrays, and NaN is close enough for libraries built on top of NumPy.

toobaz commented 4 years ago

I totally agree with @shoyer 's proposal.

It would be nice to leave a way for users to force keeping np.NaNs as such (in order to keep the old comparisons semantics, and maybe even to avoid the conversions performance hit?), but it might be far from trivial, and hence not worth the effort.

TomAugspurger commented 4 years ago

I’m probably fine with transparently concerting NA to NaN in asarray for float dtypes. I’m less sure for integer, since that goes against our general rule of not being lossy.

toobaz commented 4 years ago

I agree. Without pd.NA, pandas users sooner or later were going to get accustomed to ints with missing values magically becoming floats, but that won't be true any more.

(Ideally, we would want a numpy masked array, but I guess asarray can't return that)

jorisvandenbossche commented 4 years ago

[Pietro, about deciding whether an operation should better return NaN or NA] And since it's everything but obvious (to me at least), I consider this as an argument for not distinguishing the two concepts.

I agree it is not obvious what is fundamentally "best". But, if we don't have good arguments either way, that could also be a reason to just follow the standard and what numpy does.

I propose another approach: let us try to define which pandas operations currently producing np.NaN should start to produce pd.NA if we wanted to distinguish the two.

In theory, I think there can be a clear cut: we could produce NaN whenever an operation with numpy produces a NaN, and we produce NAs whenever it is a pandas concept such as alignment or skipna=False that produces NAs. Now, in practice, there might be corner cases though. Like the unobserved categories you mentioned (which can be seen as missing (-> NA) or as length 0 (-> mean would give NaN). mean([]) might be such a corner case in general. Those corner cases are certainly good reasons to not make the distinction.

I think what @toobaz is saying is that 0 / 0 truly is indeterminate (if we think of it as the solution to 0x = 0, then it's essentially any number, which isn't too different from the meaning of NA).

OK, that I understand!

I guess I'm still unsure what would be the actual utility of having another value to represent "bad data" when you already have NA for null values?

Apart from the (possible) utility for users to be able to represent both (which is of course a trade-off with the added complexity for users of having both), there are also other clear advantages of having both NaN and NA, I think:

It is (mostly / more) consistent with R, Julia, SQL, Arrow, ... (basically any other data system I am somewhat familiar with myself)
It is easier to implement and possibly more performant / more able to share code with the masked integers. (e.g. we don't need to check if NaNs are produced in certain operations to ensure we convert them to NA)

This last item of course gets us into the implementation question (which I actually wanted to avoid initially). But assuming we go with:

Always use NA instead of NaN for floating point values in pandas. This would change semantics for comparisons, but otherwise would be equivalent. It would not be possible to put NaN in a float array.

would people still use NaN as a sentinel for NA, or use a mask and ensure all NaN values in the values are also marked in the mask? The advantage of using NaN as a sentinel is that we don't need to check for NaN being produced or inserted (as the NaN will be interpreted as NA anyway) and easier conversion to numpy. The advantage of a mask that we can easier share code with the other masked extension arrays (although with a mask property that is dynamically calculated, we can probably still share a lot) and it keeps open the potential of zero copy conversion with Arrow.

jorisvandenbossche commented 4 years ago

I agree that for the conversion to numpy (or at least __array__), we can probably use floats with NaNs.

TomAugspurger commented 4 years ago

FWIW, @kkraus14 noted that cuDF supports both NaN and NA in their float columns, and their users are generally happy with the ability to have both.

jorisvandenbossche commented 4 years ago

Ah, yes, I forgot to chat with him about that in person. @kkraus14 if you are able to write down some of your experience related to this to expand on the above, that would be very welcome!

kkraus14 commented 4 years ago

Sure:

We've used the Apache Arrow memory layout since the beginning in cuDF, and as @TomAugspurger noted, our users have generally been fond of being able to have both NULL and NaN in floating point columns. From our perspective, the reasoning behind it was that NULL is a missing value, whereas NaN is defined invalid.

From this, NULL values have consistent behavior across all of the types with respects to things like binary operations / comparisons where NULLs always propagate as opposed to NaNs where they return True / False for many binary comparisons. For operations that implicitly compare things like NULL and NaN values like sorting, groupby, joins, etc. we generally give knobs to allow controlling the behavior.

This also aligns with typical binary IO formats like Parquet / ORC where they distinguish NaN vs NULL as well.

Dr-Irv commented 4 years ago

Prior to using pandas, I spent a fair part of my career writing numerical software, but the concept of "missing data" was not something we worried about. And if we saw NaN in a result, it meant something numerically. So my comments are coming from that perspective.

IMHO, we should keep np.NaN and pd.NA separate, to correspond to "not a number" and "missing values", and have the to_numpy and from_numpy methods convert between the pandas representation of pd.NA to/from the numpy representation of np.NaN. From my perspective, at some point way back in time, a choice was made to say that the numpy np.NaN would mean "missing values" for pandas. But even numpy has a facility to separate the two. See the numpy.ma module: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html, where they write "Masked arrays are arrays that may have missing or invalid entries. The numpy.ma module provides a nearly work-alike replacement for numpy that supports data arrays with masks."

So maybe we should consider leveraging the numpy.ma module to handle parts of the implementation and the semantics of "missing" vs. "not-a-number".

jreback commented 4 years ago

-1 on using numpy.ma at all. this has always been a 1/3 baked implementation and should not inform pandas at all

Dr-Irv commented 4 years ago

-1 on using numpy.ma at all. this has always been a 1/3 baked implementation and should not inform pandas at all

Fair enough from the implementation standpoint, but the fact that numpy has the ability to treat missing values separately shows that even they thought about distinguishing "not a number" vs. "missing values", which is similar to what was said about cuDF.

jreback commented 4 years ago

-1 on using numpy.ma at all. this has always been a 1/3 baked implementation and should not inform pandas at all

Fair enough from the implementation standpoint, but the fact that numpy has the ability to treat missing values separately shows that even they thought about distinguishing "not a number" vs. "missing values", which is similar to what was said about cuDF.

thought about generally, but the difficulty is a practical one

meaning do we now have skipna=missing|NaN

this will become confusing / complex for the average user very rapidly

Dr-Irv commented 4 years ago

meaning do we now have skipna=missing|NaN

this will become confusing / complex for the average user very rapidly

To some extent, the choice that pandas made way back when to a) Treat np.NaN as missing b) Make skipna=True the default has created this potential confusion.

IMHO, there are two issues at play: 1) What is the right design that makes sense independent of the choices made in the past? (i.e., what would you do if you were really creating pandas from scratch?) 2) How do we transition people to a new behavior (whatever it may be)

I think this discussion may sometimes be mixing these two issues.

If we were really starting over, I don't think we'd choose to interpret numpy np.NaN as "missing", and we'd separate the meaning of "missing" versus "not a number" (as in R, Julia, SQL, Arrow, cuDF, etc.). If that is the case, and we agree that is the design going forward, then we have to deal with the transition, which relates to the confusion issue you raise above.

I'm currently a little bothered by the following inconsistency:

>>> s=np.array([1.0, 2.0, np.nan])
>>> s.sum()
nan
>>> pd.Series(s).sum()
3.0

This inconsistency was created by (a) and (b) above. Now, we have an opportunity to resolve it, while being mindful of the transition issue from current pandas 1.0 behavior to some new behavior in pandas 2.0.

jbrockmendel commented 4 years ago

[Joris] There is already one for infinity (which is actually very similar to NaN, see more below): pd.options.mode.use_inf_as_na (default False). We could have a similar one for NaN (or a combined one).

I am -0.75 for adding another global flag for this. We're not very consistent about respecting this, plus have poor tests/perf for the inf_as_na mode as it is.

toobaz commented 4 years ago

the fact that numpy has the ability to treat missing values separately shows that even they thought about distinguishing "not a number" vs. "missing values"

As much as I appreciate examples from other languages/libraries I reject this conclusion: numpy most likely introduced masked arrays for the exact same reason why we introduced pd.NA: that NaN only exists for floats. I'm pretty sure we would have never bothered about pd.NA if each dtype we use had its native "non-value" value, whatever it was called.

toobaz commented 4 years ago

This inconsistency was created by (a) and (b) above

No, only by (b)...

You cite other languages, which is interesting: but technically speaking, pandas will also have the distinction anyway (as long as we have pd.NA and we read data with NaNs).

The point is different: did those language look for the distinction, or did they adapt to the fact that NaN was only present for floats?

toobaz commented 4 years ago

Sorry for the multiple comments.

I can understand the desire to adapt to other libraries and "speak/understand their language" in terms of support for pd.NA and NaN.

What I'm really arguing against is that pandas semantics for pd.NA and NaN should be "encouraged" to diverge for any assumingly philosophical distinction between "invalid" and "missing".

If instead we insist on adopting, and teaching to our users, a conceptual distinction between the two, I humbly ask to reconsider - and "solve" - my example above.

... together with the fact that pd.Series([0], dtype=float) / 0 should (I guess) return np.NaN (it's the prototipical case), while pd.Series([0], dtype=int) / 0should (I guess) return pd.NA, as the only other option is raising an error where currently no error is raised.

Dr-Irv commented 4 years ago

This inconsistency was created by (a) and (b) above

No, only by (b)...

I say (a) as well, because if when pandas was created, a decision was made (as in the other languages) to not use np.NaN to mean "missing", we wouldn't be having this discussion. But that's water under the bridge at this point.

You cite other languages, which is interesting: but technically speaking, pandas will also have the distinction anyway (as long as we have pd.NA and we read data with NaNs).

Read data? How? Our readers translate "missing" to np.NaN. If we decide to use pd.NA universally to mean "missing", we'd presumably change the readers. (Note - this is why I thought it was important to create convert_dtypes())

The point is different: did those language look for the distinction, or did they adapt to the fact that NaN was only present for floats?

I'm guessing they looked for them (but I'm not in the position to know for sure) and now it's become our turn to look for the distinction. But that's just my opinion.

Dr-Irv commented 4 years ago

If instead we insist on adopting, and teaching to our users, a conceptual distinction between the two, I humbly ask to reconsider - and "solve" - my example above.

... together with the fact that pd.Series([0], dtype=float) / 0 should (I guess) return np.NaN (it's the prototipical case), while pd.Series([0], dtype=int) / 0should (I guess) return pd.NA, as the only other option is raising an error where currently no error is raised.

We're getting into the issues of the meaning of NaN in the IEEE standard, and there are quiet NaNs and signaling NaNs. Which ones you use and how you handle them could be argued many different ways.

Consider these 3 different behaviors in python and numpy:

>>> 0/0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
>>> import numpy as np
>>> np.array([0])/np.array([0])
__main__:1: RuntimeWarning: invalid value encountered in true_divide
array([nan])
>>> np.array([0.0])/np.array([0.0])
array([nan])

The warning from numpy is only issued once. So if you do the same computation again, you don't get a warning.

In any case, there are reasons one could choose any of the above behaviors as default in pandas. But I don't think that pd.Series([0], dtype=int) / 0 should return pd.NA. I think it would return np.NaN (using the numpy behavior, but without a warning)

toobaz commented 4 years ago

I don't think that pd.Series([0], dtype=int) / 0 should return pd.NA. I think it would return np.NaN

... which means that (i) we will still upcast ints to floats in cases like this, despite now having nullable ints, or alternatively that (ii) we will have to add np.NaN to nullable ints?!

jorisvandenbossche commented 4 years ago

which means that (i) we will still upcast ints to floats in cases like this, despite now having nullable ints

Division of ints will always give float dtype (for type stability, as the result can be floats), that is independent of nullability of the dtypes.

toobaz commented 4 years ago

Division of ints will always give float dtype

Oops, sorry, bad example.

jorisvandenbossche commented 4 years ago

@kkraus14 thanks for the input!

[Pietro] I'm pretty sure we would have never bothered about pd.NA if each dtype we use had its native "non-value" value, whatever it was called.

FWIW, R actually uses sentinels (so has a "null value" for each "data type", but the type system in R is a quite different), but still has both NaN and NA for the numeric data type.

[Pietro] but technically speaking, pandas will also have the distinction anyway (as long as we have pd.NA and we read data with NaNs).

@toobaz If going with the option of not distinguishing NaN and NA, I was assume we actually never have NaN, so also not from reading data. I would say: if we go for the "only NA" option, any data that has both NaN and NAs in it will get converted to all NAs? You have something else in mind? (actually still allow NaNs coming from input data, but not "create" them in operations like 0/0?)

From https://github.com/pandas-dev/pandas/issues/32265#issuecomment-593505078:

the fact that pd.Series([0], dtype=float) / 0 should (I guess) return np.NaN (it's the prototipical case)

@toobaz weren't you arguing above that this should return NA, and not NaN (at least that is how I understood it, since you were arguing for not distinguishing the result of 0/0 with "missing").

dsaxton commented 4 years ago

It is (mostly / more) consistent with R, Julia, SQL, Arrow, ... (basically any other data system I am somewhat familiar with myself)

I'm not sure if consistency with other tools is much of concern when many of these don't seem consistent with each other; e.g., SQLite treats 0 / 0 and 1 / 0 as NULL and not NaN, Postgres doesn't allow division by zero or logs of negative numbers (I think it does have NaN values, but they seem to compare equal), and while R does have NaN it doesn't behave like NaN (NaN == NaN is NA).

So I feel that pandas should do whatever makes the most sense for pandas users / developers and not worry too much how these other tools behave.

It is easier to implement and possibly more performant / more able to share code with the masked integers. (e.g. we don't need to check if NaNs are produced in certain operations to ensure we convert them to NA)

That does seem like a good argument for allowing NaN.

jorisvandenbossche commented 4 years ago

I'm not sure if consistency with other tools is much of concern when many of these don't seem consistent with each other;

The SQL (at least PostgreSQL, SQLite) example is indeed not a very good example. For Postgres, although they support storing NaN, they don't support the operations that can create them, and they also deviate from the standard on how to treat them (eg in ordering, see https://www.postgresql.org/docs/9.1/datatype-numeric.html). Now, there are also other (more modern?) SQL databases such as Clickhouse that seem to support NaN (at least have both, and create NaN on 0/0, but didn't look in detail how they are handled in for example reductions).

It's certainly true that we shouldn't look too much at others (certainly if for almost every option you might find an example), and decide what is best for pandas. But I think it is still valuable information for deciding that.

One strong reason for me to have both NaN and NA, is a practical reason: the computational "engine" backing pandas and other dataframes will for the foreseeable future be mostly numpy and maybe increasingly arrow, I think (pandas itself, and thus also dask, is now based on numpy; fletcher is experimenting with pandas based on arrow; cudf and vaex are based on arrow, modin has both a pandas and arrow backend). And both those computational engines have NaNs and produce NaNs from computations. So (I think) it will always be the easiest to just go with whathever those computational engines return in unary/binary operations.

Note that in such a case we can still see NaN as missing when it comes to skipping (which is eg what R does) if that is preferred, which basically would delay the cost for checking NaNs to those operations (instead of already needing to do the check on binary operations like division).

I will try to make a summary of the discussion and arguments up to now.

jrhemstad commented 4 years ago

Quick question from someone who is definitely not a Pandas user.

Assuming that Pandas keeps np.nan and doesn't add pd.NA for floating point types, what would the behavior for na_position in sort_values be?

As a C++ dev, supporting the na_position parameter for NaN's today requires deviating from the IEEE 754 standard as I have to specialize the floating point comparator to change the behavior of NaN < x and NaN > x.

As a more general point, when deciding on the np.nan vs pd.NA, I'd strongly support any solution that doesn't lead to non-conformance with IEEE 754 (like the sort_values example).

jorisvandenbossche commented 4 years ago

@jrhemstad a very late answer (and I see the discussion on at cudf came to a conclusion somewhat, where you will have "special" behaviour for NaN when it comes to eg sorting), but to answer your question anyway:

Assuming that Pandas keeps np.nan and doesn't add pd.NA for floating point types, what would the behavior for na_position in sort_values be?

Not adding pd.NA is not really considered as an option (at least for the new nullable float dtype which is being discussed in this issue, the existing float64 dtype will (at least for now) keep using np.nan).
But assuming you meant "assuming Pandas keeps np.nan" in addition to pd.NA: I think we will just keep the current behaviour (which is sorting NaNs last, by default, with the option to specify to sort first). At least, changing this behaviour is not yet brought up in the discussion before, and thus was not yet discussed. And for nulls (NA), we will certainly keep that behaviour.

Note that this NaN-sorting behaviour is not pandas specific. Numpy has the same behaviour (so pandas can rely on numpy for this): following EEE 754 for elementwise comparisons, but deviating from it to have deterministic sorting (NaNs last, see docs). And for example also Julia follows the same behaviour.

kkraus14 commented 4 years ago

@jorisvandenbossche any thoughts about if na_position would control pd.NA or NaN objects or both? How would one be ordered relative to the other?

jorisvandenbossche commented 4 years ago

I would propose that it controls both (with "last", you would get the order of "values - NaNs - NAs"). This is similar to how traditional databases like PostgreSQL sort, but also ClickHouse (doc) or for example Julia.

pandas-dev / pandas

API: distinguish NA vs NaN in floating dtypes #32265