scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Achieve masking #234

Closed mverzett closed 4 years ago

mverzett commented 4 years ago

Hi All,

I have a function that is only defined for a subset of events (it has to use the first muon in the event, if there) and I would like to attach it to the full event table, because I am stubborn :) and to bundle the data. I am looking for something that allows the following:

event_mask = np.array([True, True, False, False, True, False]) # size 6, 3 True
event_observable = awkward.JaggedArray.fromiter([[1,2,3], [4,5,6], [7,8,9]]) # jagged, size 3, only for the true events
awkward.MaskedSomething(event_mask, event_observable) # --> behaves like a shape (6, -) with empty or NaN values in the false part, I don't particularly care

Is there some class that supports this functionality? How? I checked MaskedArray and IndexedMaskedArray, but both require the value to be as large as the mask, which in my case cannot be.

jpivarski commented 4 years ago

I made it a to-do item yesterday: https://github.com/scikit-hep/awkward-1.0/issues/127

But you can use the IndexedMaskedArray constructor to make option-type data (data with Nones) without changing the length of the content.

What you have to do is make a "mask" that is increasing, non-negative integers for non-masked values and -1 for masked values:

mask = np.full(length_of_table, -1)
mask[selection] = np.arange(np.count_nonzero(selection))
to_put_in_table = awkward.IndexedMaskedArray(mask, events[selection])

The word "mask" is inappropriate here because this is a reshaping index with negative values interpreted as None. It will be changing to "IndexedOptionArray".

mverzett commented 4 years ago

Indeed it works, I don't know why it did not when I tried, but thanks!

mverzett commented 4 years ago

Sorry to bother again, the masking in this way works, but when trying to compute a deltaR between a Jagged LorentzVectorArray and a flat IndexedMaskedArray I get an exception, the shapes match, I think there is just some assumption in the broadcasting. Now is quite late, I will try to write a simple test case in the next days.


    return self.awkward.numpy.sqrt(self.delta_r2(other))
  File "/home/mverzett/miniconda3/lib/python3.7/site-packages/uproot_methods/classes/TLorentzVector.py", line 86, in delta_r2
    return (self.eta - other.eta)**2 + self.delta_phi(other)**2
  File "/home/mverzett/miniconda3/lib/python3.7/site-packages/numpy/lib/mixins.py", line 25, in func
    return ufunc(self, other)
  File "/home/mverzett/miniconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 1027, in __array_ufunc__
    content = recurse(data)
  File "/home/mverzett/miniconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 1024, in recurse
    content[good] = x.reshape(-1)[parents[good]]
  File "/home/mverzett/miniconda3/lib/python3.7/site-packages/awkward/array/base.py", line 256, in __getattr__
    raise AttributeError("no column named {0}".format(repr(where)))
AttributeError: no column named 'reshape'```
jpivarski commented 4 years ago

This might be related to https://github.com/scikit-hep/uproot/issues/458, though you're not using ChunkedArrays.

Somewhere in there, there's an assumption that an array node's content is a numpy.ndarray, and that assumption is not true. (This is exactly the sort of inconsistency that led me to a rewrite—in Awkward0, some of the arrays are Awkward and some of them are NumPy. Overly strong assumptions didn't get caught in normal usage, but they are caught as we get into more complete usage of the Awkward model: IndexedMaskedArrays and ChunkedArrays).

jpivarski commented 4 years ago

Here's where it fails:

https://github.com/scikit-hep/awkward-array/blob/a2645fdaed1a6997c4677ae47cbb2cd0663e8a21/awkward/array/jagged.py#L1020-L1025

It's applying a ufunc and needs to broadcast one side of the binary operation to fit the other (or it's just going through the motions, if it already fits). The array it's looking at has a shape (I think I needed to add that to everything so that Pandas or Dask would be happy) and it has multiple, regularly sized dimension (i.e. not 1D and not using arbitrary-length sublists through JaggedArray). The array isn't a NumPy array, so there's no reshape.

Maybe the quick fix is to give all Awkward0 arrays a reshape operation. On any type that allows a non-1D shape, reshape(-1) would just apply reshape(-1) to its starts, stops, index, mask, or other structure array. As a very quick fix, we could just put in an if-statement to catch your array type.

A maybe even quicker fix is to not use regular-sized dimensions in your analysis: use JaggedArrays instead. I think structure1d() does that.

I'm looking for quick-fixes here because Awkward1 is almost ready. It will be glued into Uproot in April, and then it would be the primary version for many users. (Awkward1 doesn't have this issue because it uses a new RegularArray node type to represent rectilinear dimensions for non-NumPy content.)

mverzett commented 4 years ago

@jpivarski quick fixes are good for me! :) Can you make an example of structure1d() usage?

jpivarski commented 4 years ago

(If I remember right!) It's a zero-argument method you can call on any array, or maybe just JaggedArrays. It returns a certain if the same data with regular dimensions replaced by JaggedArrays.

mverzett commented 4 years ago

Indeed it works, thanks!