scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Are nested masked arrays a valid type? #218

Open nsmith- opened 4 years ago

nsmith- commented 4 years ago

I managed to end up with something like

a = ak.JaggedArray.fromcounts(
    np.array([1, 0, 3]),
    ak.MaskedArray(
        np.array([True, False, True, False]),
        np.arange(4),
    )
)

which gives <JaggedArray [[None] [] [1 None 3]] at 0x000111048690>, and then proceeded to select some index inside the array with

af = a[a.argmax()].pad(1, clip=True).flatten()

leaving me <MaskedArray [None None 3] at 0x00013a865750>. All good so far, but the type is very strange: ArrayType(3, OptionType(OptionType(dtype('int64')))) I don't understand what nested OptionType means. I can collapse it at least: af[~af.boolmask()].content.content returns array([3]).

jpivarski commented 4 years ago

It's valid and it should be equivalent to the union of the two masks (from maskedwhen=True). I had thought there was logic to say that OptionType(OptionType(X)) is an equal type to OptionType(X); I put in a few of these algebraic things, but that's a rabbit hole.

Yeah, it's true:

>>> import awkward, numpy
>>> array = awkward.MaskedArray([False, True, False, True, False, True],
...             awkward.MaskedArray([False, False, False, True, True, True],
...             [1.1, 2.2, 3.3, 4.4, 5.5, 6.6]))

>>> # checkerboard unions with half-and-half
>>> array
<MaskedArray [1.1 None 3.3 None None None] at 0x78d638ac5a90>

>>> # two levels deep
>>> array.type
ArrayType(6, OptionType(OptionType(dtype('float64'))))

>>> # is equivalent to one level deep
>>> array.type == awkward.type.ArrayType(6,
...                   awkward.type.OptionType(numpy.dtype("float64")))
True

I have to decide how much of that should survive into the new era. One good thing about reimplementation is that stuff that seemed like a good idea at the time but never actually got used goes away. Users won't be encouraged to make their own array structures anymore, so I guess I don't need to police it. I guess you've found that pad needs to be smarter: if it's already looking at a MaskedArray, it should add to its mask, rather than introduce another layer.

Also seeking opinions: I want to change the name from "MaskedArray" to something else because of how often we use the word "mask" to refer to slicing with a boolean array—a concept that's similar enough but different from what MaskedArray does to cause confusion. "Masked" is what NumPy calls it, though maybe it's a bad thing to use a similar word for not-really-the-same classes (numpy.ma.MaskedArray isn't interchangeable with awkward.MaskedArray: the latter can contain jagged data, for instance). Besides, "masked" describes the how, not the what.

It seems to me that we have two other words for this, "nullable" and "optional." "Nullable" is an SQL term and "optional" or "option" is popular among modern programming languages. Haskell uses "maybe." I'm leaning toward

(I'm not ignoring your other issue, #217; it just looks more difficult at the moment.)