scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

awkward.isnan #227

Open nsmith- opened 4 years ago

nsmith- commented 4 years ago

Currently, there is no convenient way for checking if a given array has masked entries if it is not a MaskedArray type, i.e. the mask methods are not universal. For float arrays, a workaround is numpy.isnan(array.fillna(numpy.nan)). I would propose array.isnan(axis=-1) or awkward.isnan(array, axis=-1) signatures, where the axis chooses the depth of the structure at which to evaluate the masking status, much like the array.flatten(axis=-1) function.

jpivarski commented 4 years ago

I think this is a good idea, the only thing I'm on the fence about is the name. I already started down the path of imitating Pandas with fillna, but whereas the difference between a masked integer and a floating-point NaN value is insignificant in Pandas, a None where a complex record might be is very different from a floating-point sqrt(-1). I'm beginning to regret ignoring the difference.

If a user is thinking of floating-point NaN, they'd probably think that the function would apply at the deepest level, even if that's a different level for different fields. If the user is thinking of None, the OptionType, they'd probably think that the function would apply at exactly one level, as you've described.

Maybe isna is a better choice than isnan, because it's not the same as the NumPy function. Or maybe isnone.

nsmith- commented 4 years ago

pandas has isna, notna, isnull, notnull and they all treat float('nan') and None the same. I suspect they also regret hijacking float('nan').

jpivarski commented 4 years ago

Yeah. They have so many synonyms because they're courting a userbase from R, which has a distinction between na and null. But also, I'm arguing that the distinction matters less if all of the values in question are some kind of number (or string, but more likely dictionary-encoded category). I'm thinking we might want to make a sharper distinction between a floating-point NaN and a None that can apply to any type.

smit2k14 commented 4 years ago

Hey... I am interested in solving this issue. Can I be assigned to it?

jpivarski commented 4 years ago

I certainly wouldn't be opposed. Be aware, though, that the primary development branch of Awkward is in the scikit-hep/awkward-1.0 repo. Contributions to the 0.x repo are welcome and affect existing users, but would have to be reimplemented in 1.0.

Therefore, what you do here could be seen as a "trial run" for the feature, getting it in front of users to see how useful it is, and maybe we might need to change some names in the 1.0.

Also, note that the direction Pandas is headed is to more fully distinguish between "NA" (for any data type) and "NaN" (for floating point). This seems to be their biggest change when they released Pandas 1.0. If they've renamed the functions for that, we should follow their new naming scheme.

nsmith- commented 4 years ago

I would fully support this. An example of a similar global function that applies to MaskedArrays is pad() which may help with implementation. I would suggest a signature like isna(self, axis=None) where axis behaves as follows: None implies stop at first dimension that has a mask, while an integer axis is decremented each time the call passes through JaggedArray until it is zero, allowing e.g.:

a = ak.fromiter([None, [1, 2, None, 3], []])
a.isna(axis=None).tolist() == [True, False, False]
a.isna(axis=0).tolist() == [True, False, False]
a.isna(axis=1).tolist() == [None, [False, False, True, False], []]

There may be reason to actually consider [] equivalent to None in the first case, but I'm not sure.

nsmith- commented 4 years ago

Probably axis=0 should be the default to be consistent with pandas ExtensionArray interface, as used in awkward1.

jpivarski commented 4 years ago

Unless a compelling argument can be given, I'd rather not consider [] equivalent to None.

There's two schools of thought on this: Google's formats (Protobuf, Flatbuffers, and Dremel, which became Parquet) don't make a distinction between [] and None, so values can only be required (exactly 1), optional (0 or 1), or repeated (0 or more). This seriously complicates some of the implementations (especially Parquet) and it blocks future distinctions one might want to make between "we didn't see any particles" and "the information about whether we saw any particles has been lost."

Other formats (Thrift, Avro, and Arrow) take the "programming language" approach that distinguishes [] from None. In my opinion, this is simpler because it's like a free algebra of generators: apply a "list" generator to the concrete number type and you get a list of numbers; apply the "option" generator and you get numbers the might be missing; apply "list" then "option" and you get lists of numbers that might be missing; apply the "option" then "list" and you get possibly-missing lists of numbers (which is not the same thing; they don't commute because the algebra is free). Equating [] with None introduces complexity into the algebra (like 3 + 7 == 4 + 6, which is more complicated than a free collection of strings).

smit2k14 commented 4 years ago

So, is this functionality to be implemented for Awkward Arrays in general or just for Jagged Arrays? I was thinking that we could just flatten the array to find out the 'NaN' values and reconstruct them from scratch by the start and stop counts, with the content being True/False accordingly

jpivarski commented 4 years ago

All operations should apply to all Awkward Array types. One of the most persistent issues has been when somebody's code is running fine on JaggedArrays, then for some reason they have MaskedArray of JaggedArray, or maybe ChunkedArray of JaggedArray, or something similar, possibly without realizing it, and the script no longer works.

In this environment, "preserving the abstraction" means only requiring the users to think about the logical meaning of their data, not the specific structures built up to represent it.

smit2k14 commented 4 years ago

So, should I implement this method in the base file of awkward array or implement it differently for each of the types of Awkward Arrays? Also, can I go ahead and use my above proposed solution or should I think of another one? (i.e. flattening and reconstruction)

jpivarski commented 4 years ago

It should probably be separately implemented for each of the array classes (i.e. as a method on each) because you'll probably have to do something different in some cases. In base.py, there's a superclass for all array types that have one _content (AwkwardArrayWithContent) which can help to reduce duplication, if there is any.

Meanwhile, isna has been implemented in Awkward 1.0 because Pandas required it. This definition doesn't go arbitrarily deep into the structure:

Therefore, this function won't be unwrapping JaggedArrays, applying itself to the contents, and then wrapping the result as JaggedArrays. For conformance with the Pandas function of the same name,

awkward.fromiter([[1.1, 2.2, None, 3.3], [], [4.4, None, 5.5]]).isna()

would return

[False, False, False]

because none of those three lists are missing. (That is, isna doesn't care whether there are missing elements inside the lists.)

A function that descends all the way, giving

[[False, False, True, False], [], [False, True, False]]

in the above example, should have a different name or be a non-default parameterized version of isna. (Otherwise, we won't be able to put Awkward arrays into Pandas columns, because this is what Pandas expects.)