scikit-hep / fastjet

Jet-finding in the Scikit-HEP ecosystem.
https://fastjet.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 11 forks source link

Problem with awkward array masking with exclusive_jets_constituent_index() #238

Open jbrewster7 opened 12 months ago

jbrewster7 commented 12 months ago

Hello, I am working with @kpachal, @mswiatlo, and @lgray on the development of Coffea and some analysis that involves the use of fastjet. We've been very happy with how smoothly everything using awkward arrays integrates with fastjet.

However, we have noticed a problem when running exclusive_jets_constituent_index() with a masked awkward array. If an awkward array has a boolean mask applied to it on axis 0 before clustering, exclusive_jets_constituent_index() returns indices that are out of range. An error is also thrown when trying to call exclusive_jets_constituents() since the out of range indices are being applied to the array.

Here is an example. Running this

import awkward as ak 
import fastjet

px = ak.Array([[-0.181, 0.377],
               [0.54, 0.253],
               [0.294, 4.65],
               [1.45, 12.8],
               [-3.55, -1.33]])

py = ak.Array([[-0.798, 0.116],
               [-0.65, -0.0342],
               [0.254, -1.88],
               [-0.179, -1.8],
               [-1.64, -1.03]])

pz = ak.Array([[-0.652, -0.0749],
               [-0.00527, -0.0731],
               [-0.259, -3.29],
               [-0.876, -7.18],
               [-0.0941, 0.147]])

E = ak.Array([[1.06, 0.425],
              [0.857, 0.3],
              [0.467, 6],
              [1.71, 14.8],
              [3.91, 1.7]])

mask = ak.Array([True,True,False,True,True])

eg = ak.zip(
    {
        'px': px,
        'py': py,
        'pz': pz,
        'E': E,
    },
)

jetdef = fastjet.JetDefinition(fastjet.kt_algorithm,1)

fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituent_index(njets=2)

returns

[[[1], [0]],
 [[0], [1]],
 [[1], [0, 2, 3]],
 [[0], [1]]]
---------------------------
type: 4 * var * var * int32

which contains out of range indices at index 2 on axis 0. If we then run

fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)

it throws the error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:950, in Array.__getitem__(self, where)
    949 with ak._errors.SlicingErrorContext(self, where):
--> 950     out = self._layout[where]
    951     if isinstance(out, ak.contents.NumpyArray):

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:510, in Content.__getitem__(self, where)
    509 def __getitem__(self, where):
--> 510     return self._getitem(where)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:563, in Content._getitem(self, where)
    562 elif isinstance(where, ak.highlevel.Array):
--> 563     return self._getitem(where.layout)
    565 # Convert between nplikes of different backends

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:638, in Content._getitem(self, where)
    637 elif isinstance(where, Content):
--> 638     return self._getitem((where,))
    640 elif is_sized_iterable(where):
    641     # Do we have an array

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:555, in Content._getitem(self, where)
    548 next = ak.contents.RegularArray(
    549     this,
    550     this.length,
    551     1,
    552     parameters=None,
    553 )
--> 555 out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
    557 if out.length is not unknown_length and out.length == 0:

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/regulararray.py:708, in RegularArray._getitem_next(self, head, tail, advanced)
    693 self._maybe_index_error(
    694     self._backend[
    695         "awkward_RegularArray_getitem_jagged_expand",
   (...)
    706     slicer=head,
    707 )
--> 708 down = self._content._getitem_next_jagged(
    709     multistarts, multistops, head._content, tail
    710 )
    712 return RegularArray(
    713     down, headlength, self._length, parameters=self._parameters
    714 )

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:409, in ListOffsetArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    406 out = ak.contents.ListArray(
    407     self.starts, self.stops, self._content, parameters=self._parameters
    408 )
--> 409 return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:497, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    495 sliceoffsets = slicecontent._offsets
--> 497 outcontent = next_content._getitem_next_jagged(
    498     sliceoffsets[:-1], sliceoffsets[1:], slicecontent._content, tail
    499 )
    501 return ak.contents.ListOffsetArray(
    502     outoffsets, outcontent, parameters=self._parameters
    503 )

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/listarray.py:544, in ListArray._getitem_next_jagged(self, slicestarts, slicestops, slicecontent, tail)
    535 assert (
    536     outoffsets.nplike is self._backend.index_nplike
    537     and nextcarry.nplike is self._backend.index_nplike
   (...)
    542     and self._stops.nplike is self._backend.index_nplike
    543 )
--> 544 self._maybe_index_error(
    545     self._backend[
    546         "awkward_ListArray_getitem_jagged_apply",
    547         outoffsets.dtype.type,
    548         nextcarry.dtype.type,
    549         slicestarts.dtype.type,
    550         slicestops.dtype.type,
    551         sliceindex.dtype.type,
    552         self._starts.dtype.type,
    553         self._stops.dtype.type,
    554     ](
    555         outoffsets.data,
    556         nextcarry.data,
    557         slicestarts.data,
    558         slicestops.data,
    559         slicestarts.length,
    560         sliceindex.data,
    561         sliceindex.length,
    562         self._starts.data,
    563         self._stops.data,
    564         self._content.length,
    565     ),
    566     slicer=ak.contents.ListArray(slicestarts, slicestops, slicecontent),
    567 )
    568 nextcontent = self._content._carry(nextcarry, True)

File ~/anaconda3/lib/python3.10/site-packages/awkward/contents/content.py:271, in Content._maybe_index_error(self, error, slicer)
    270 message = self._backend.format_kernel_error(error)
--> 271 raise ak._errors.index_error(self, slicer, message)

IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets_constituents(njets=2)

File ~/anaconda3/lib/python3.10/site-packages/fastjet/_pyjet.py:147, in AwkwardClusterSequence.exclusive_jets_constituents(self, njets)
    146 def exclusive_jets_constituents(self, njets=10):
--> 147     return self._internalrep.exclusive_jets_constituents(njets)

File ~/anaconda3/lib/python3.10/site-packages/fastjet/_multievent.py:307, in _classmultievent.exclusive_jets_constituents(self, njets)
    305 duplicate = ak.unflatten(np.zeros(total, np.int64), shape)
    306 prepared = self.data[:, np.newaxis][duplicate]
--> 307 return prepared[outputs_to_inputs]

File ~/anaconda3/lib/python3.10/site-packages/awkward/highlevel.py:949, in Array.__getitem__(self, where)
    520 def __getitem__(self, where):
    521     """
    522     Args:
    523         where (many types supported; see below): Index of positions to
   (...)
    947     have the same dimension as the array being indexed.
    948     """
--> 949     with ak._errors.SlicingErrorContext(self, where):
    950         out = self._layout[where]
    951         if isinstance(out, ak.contents.NumpyArray):

File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:63, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     60 try:
     61     # Handle caught exception
     62     if exception_type is not None and self.primary() is self:
---> 63         self.handle_exception(exception_type, exception_value)
     64 finally:
     65     # `_kwargs` may hold cyclic references, that we really want to avoid
     66     # as this can lead to large buffers remaining in memory for longer than absolutely necessary
     67     # Let's just clear this, now.
     68     self._kwargs.clear()

File ~/anaconda3/lib/python3.10/site-packages/awkward/_errors.py:78, in ErrorContext.handle_exception(self, cls, exception)
     76     self.decorate_exception(cls, exception)
     77 else:
---> 78     raise self.decorate_exception(cls, exception)

IndexError: cannot slice ListArray (of length 8) with [[1], [0], [0], [1], [1], [0, 2, 3], [0], [1]]: index out of range while attempting to get index 3 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-17/awkward-cpp/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

This error occurred while attempting to slice

    <Array [[[{px: -0.181, ...}, ...], ...], ...] type='4 * var * var * {px...'>

with

    <Array [[[1], [0]], [[0], ...], ..., [[0], [1]]] type='4 * var * var * int32'>

due to trying to use the out of range indices.

This is not a problem if we just run the unmasked array

fastjet.ClusterSequence(eg, jetdef).exclusive_jets_constituent_index(njets=2)

which gives

[[[1], [0]],
 [[0], [1]],
 [[0], [1]],
 [[0], [1]],
 [[0], [1]]]
---------------------------
type: 5 * var * var * int32

Thank you for the help on this issue, it will be greatly appreciated!

jpivarski commented 12 months ago

I'm assuming that the error happens in eg[mask], not in fastjet.ClusterSequence or exclusive_jets_constituent_index, right?

Where does ak.num(eg, axis=1) not equal ak.num(mask, axis=1)?

Or if str(mask.type) has two "vars" just like str(eg.type), where does ak.num(eg, axis=2) not equal ak.num(mask, axis=2)?

That would be the first step. The next step is figuring out why they're not equal, because computing eg[mask] presupposes that they fit together.

I made the wrong assumption. You showed that the error definitely happens in exclusive_jets_constituents, but it looks like an Awkward slicing error, somewhere in its Python implementation.

kpachal commented 12 months ago

Hi @jpivarski, OK! So .... what is best for us to do here? Would you like us to open an issue in Awkward, or should we leave that to you? Since the working example we have relies on Fastjet, it's not entirely clear to us how to provide useful feedback to Awkward on this. Thoughts welcome!!

jpivarski commented 12 months ago

The crossed-out part was assuming that it's a bug in Awkward (in eg[mask]). It might still be, but it's also possible that exclusive_jets_constituents is using it wrong.

It's implemented here:

https://github.com/scikit-hep/fastjet/blob/0ede4d336453d327184bc7b3af1988e19e8f73bb/src/fastjet/_multievent.py#L298-L307

You've verified that exclusive_jets_constituent_index returns a value, so it should be possible to walk through each one of those steps. Do the results of the intermediate steps look right?

jbrewster7 commented 11 months ago

Hi @jpivarski, the error is thrown on line 307 when slicing prepared. The real issue seems to be happening before that in exclusive_jets_constituent_index since it is returning indices out of range. I tried to look into this a bit but all I could tell was that line 208 already had these out of range indices returned from to_numpy_exclusive_njet_with_constituents. I couldn't find a way to look any further into this.

jpivarski commented 11 months ago

So exclusive_jets_constituent_index returned a value, but it was the wrong value because there are indexes out of bounds. (It also means that there is no pure Python work-around, since you can't switch to using exclusive_jets_constituent_index instead of exclusive_jets_constituent.)

I followed exclusive_jets_constituent_index back to the C++ routine, which is to_numpy_exclusive_njet_with_constituents. The error must be in here:

https://github.com/scikit-hep/fastjet/blob/0ede4d336453d327184bc7b3af1988e19e8f73bb/src/_ext.cpp#L365-L436

which is unpacking the NumPy array inputs, constructing FastJet objects, running FastJet algoriths, and then packing the results into NumPy array outputs. Since the issue is that an index is off, it's probably not the translation between NumPy arrays and FastJet objects, but in handling the FastJet objects. Maybe an off-by-one error somewhere?

jbrewster7 commented 11 months ago

Hi, after looking into this a bit more I've realized that this is a problem with more than just exclusive_jets_constituent_index, though that is the function where the differences are apparent enough to cause noticeable problems. I believe the problem is most likely occurring in ClusterSequence. If I take the same example as I originally used, with the array eg, there are problems happening in the same place for the arrays returned from the functions exclusive_jets and inclusive_jets as well. This is noticeable when mask is applied before clustering vs not applied until afterwards.

>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
 [{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
 [{px: 4.65, py: -1.88, pz: -3.29, E: 6}, {px: 14.5, py: -1.73, ...}],
 [{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 4 * var * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)
[[{px: 0.377, py: 0.116, pz: -0.0749, E: 0.425}, {px: -0.181, ...}],
 [{px: 0.54, py: -0.65, pz: -0.00527, E: 0.857}, {px: 0.253, ...}],
 [{px: 0.294, py: 0.254, pz: -0.259, E: 0.467}, {px: 4.65, py: ..., ...}],
 [{px: 1.45, py: -0.179, pz: -0.876, E: 1.71}, {px: 12.8, py: -1.8, ...}],
 [{px: -3.55, py: -1.64, pz: -0.0941, E: 3.91}, {px: -1.33, py: ..., ...}]]
---------------------------------------------------------------------------
type: 5 * var * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> mask
[True,
 True,
 False,
 True,
 True]
--------------
type: 5 * bool

Specifically I noticed that fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2] should be equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2] (which is equivalent to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3]). However, we have

>>> fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2]
[{px: 4.65, py: -1.88, pz: -3.29, E: 6},
 {px: 14.5, py: -1.73, pz: -8.31, E: 17}]
-----------------------------------------
type: 2 * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]
>>> fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[mask][2]
[{px: 1.45, py: -0.179, pz: -0.876, E: 1.71},
 {px: 12.8, py: -1.8, pz: -7.18, E: 14.8}]
---------------------------------------------
type: 2 * Momentum4D[
    px: float64,
    py: float64,
    pz: float64,
    E: float64
]

Instead, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][0] is equal to fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][1].

Also, fastjet.ClusterSequence(eg[mask], jetdef).exclusive_jets(n_jets=2)[2][1] is equal (within a decimal place) to the sum of fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[2][0], fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][0], and fastjet.ClusterSequence(eg, jetdef).exclusive_jets(n_jets=2)[3][1].

Please let me know if any of this is unclear and thank you for all the help with this!

chrispap95 commented 11 months ago

You are absolutely right. The way we handle masked arrays is wrong. Running

px, py, pz, E, offsets = self.extract_cons(self.data)
print(offsets)

gives

array([ 0,  2,  4,  8, 10])

The correct layout is

>>> eg[mask].layout
<ListArray len='4'>
    <starts><Index dtype='int64' len='4'>
        [0 2 6 8]
    </Index></starts>
    <stops><Index dtype='int64' len='4'>
        [ 2  4  8 10]
    </Index></stops>
...

This explains why you see all these extra elements between offsets 4 and 8. I think we should implement this function with start & stop offsets instead of using only the stop as the offset. @jpivarski any suggestions would be very welcome. Do you think my proposal makes any sense?

jpivarski commented 11 months ago

It sounds like a good idea. (I haven't looked deeply into the details.)

lgray commented 10 months ago

@chrispap95 / @jpivarski any movement on this? It's blocking progress for future collider analysis, it would nice to have it sorted out.