scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
805 stars 81 forks source link

High-level function for selecting collections of different jaggedness; "broadcast-slice"? #492

Open dildick opened 3 years ago

dildick commented 3 years ago

(@jpivarski Feel free to suggest a better issue title. I can also add more information as needed.)

I'm trying to analyze a data set that has N events, in which each event has M muons and G clusters of hits in a detector. The M muons can be matched to any of the G clusters (including multiple clusters). E.g. assume 7 clusters in an event 0,...,6 with muon0 being matched to cluster2 and cluster5, muon1 matched to cluster2 and cluster6, and the remaining clusters unmatched. The index array would look like (assuming the first two events do not have muons matching to clusters):

>>> sim_id_gem_cluster
<Array [[], [], [[2, 5], [6, 2]]] type='3 * var * var * int64'>

Muon properties such as pT, eta, phi but also sim_id_gem_cluster are zipped together

    sim_muon = ak.zip({
        "pt" : tree["sim_pt"],        
        "eta" : tree["sim_eta"],    
        "phi" : tree["sim_phi"], 
        "sim_id_gem_cluster" : tree["sim_id_gem_cluster"]  
    })

After zipping together, the types are

sim_muon.pt -> type='1000 * var * var * int32'>
sim_muon.sim_id_gem_cluster -> type='1000 * var * var * int64'>

Similarly, cluster data is zipped together, e.g.

    gem_cluster = ak.zip( {
        "bx" : tree["gem_cluster_bx"],                
        "station" : tree["gem_cluster_station"]  
    })

With types

gem_cluster.bx -> type='1000 * var * int32'>
gem_cluster.station -> type='1000 * var * int32'>

The question is: how do I select select the muons which are matched to at least one cluster which satisfies the criteria gem_cluster.bx == 0 and gem_cluster.station == 1?

nsmith- commented 3 years ago

If I understand right, you need to resolve this cross-reference sim_id_gem_cluster into the appropriate entry in the gem_cluster array? This is a pretty common pattern in e.g. CMS NanoAOD (see Electron_JetIdx etc.). The challenge is that the cluster index is currently a local index with respect to the start of each event. What I have done for coffea NanoEvents is to convert the index to a global one by adding the event offsets (of the gem_cluster array) and then using a IndexedArray. For example,

import numpy as np
import awkward1 as ak

sim_muons = ak.zip(
    {
        "pt": ak.Array([[], [], [20.0, 30.0]]),
        "cluster_idx": ak.Array([[], [], [[2, 5], [6, 2]]]),
    }
)

gem_clusters = ak.zip(
    {
        "value": ak.Array(
            [[10.0, 20.0], [30.0, 40.0], [50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0]]
        ),
        "bx": ak.Array([[-1, 0], [1, 0], [0, 1, 2, -2, 1, 1, 1]]),
    }
)  # type "nevents * nclusters * whatever"
id_global = ak.flatten(
    sim_muons.cluster_idx + np.asarray(gem_clusters.layout.starts), axis=None
)
sim_muons["clusters"] = ak.Array(
    ak.layout.ListOffsetArray64(
        sim_muons.layout.offsets,
        ak.layout.ListOffsetArray64(
            sim_muons.layout.content.offsets,
            ak.flatten(gem_clusters)[id_global].layout,
        ),
    )
)

Then you could compute a mask that selects, e.g. muons that have at least one cluster with bx 0:

In [2]: ak.any(sim_muons.clusters.bx == 0, axis=1)
Out[2]: <Array [[], [], [False, False]] type='3 * var * bool'>

or select using whatever else is available in the cluster record.

nsmith- commented 3 years ago

In general, these sort of "build-up structure" operations are part of what the NanoEvents schema classes are trying to abstract, but it would be nice if awkward-array made a few convenience functions to simplify this. For example, if we were in numpy we could use the multi-axis fancy-indexing:

a = np.random.normal(size=12).reshape(3,4)
idx = np.random.randint(0, 3, size=24).reshape(3,4,2)
idx0 = np.arange(3).repeat(8).reshape(3,4,2)  # make a first-axis indexer
a[idx0, idx]

One of the requirements there is that each item in the getitem tuple is the same shape. Perhaps a new type of multi-axis indexing with N ListArrays could be allowed, where as long as they are all broadcast-compatible, they do the same indexing operation as in numpy but generalized to jagged axes.

jpivarski commented 3 years ago

Thanks, @nsmith-! Yesterday, @dildick and I were talking about this on Gitter (https://gitter.im/Scikit-HEP/awkward-array). Since this is a common problem, we need a way to do it with high-level functions (i.e. without explicitly building layouts). I've come to the conclusion that an appropriate high-level function does not exist, and your solution involves layouts, too (although quite compact).

The big question for such a function is, "What part of the problem should it handle?" If it's hyper-specific, then it could only be used for this case. If it's totally general, it might not be obvious that it's a solution to this problem. Naming that function, determining what its arguments and return value should be is the subject of this issue.

For the record, I'm going to copy-paste my Gitter solution here, just so that the information isn't lost. (I don't think there's a way to cross-reference a GitHub issue with a time-slice in a Gitter stream.)


My other meeting just finished and I have a low-level solution for you. As a reminder, my version of your arrays are

>>> sim_id_gem_cluster
<Array [[], [], [[2, 5], [6, 2]]] type='3 * var * var * int64'>
>>> gem_cluster
<Array [[], ... {id: 4}, {id: 5}, {id: 6}]] type='3 * var * {"id": int64}'>
>>> gem_cluster.tolist()
[[], [], [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]

The problem is that each muon in each event (2 levels of jaggedness) indexes a cluster in each event (1 level of jaggedness). To be able to slice gem_cluster with sim_id_gem_cluster, you need a full copy of each event's clusters for every muon. In other words, we need to make gem_cluster look like this:

[[], [], [[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
          [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]]

That's the part that unfortunately doesn't have a high-level function. It can be made by creating an index that repeats as many times as the muons:

>>> index = np.repeat(np.arange(len(sim_id_gem_cluster)), ak.num(sim_id_gem_cluster))
>>> index
array([2, 2])

and then using that index to build a (low-level "layout") IndexedArray:

>>> indexedarray = ak.layout.IndexedArray64(ak.layout.Index64(index), gem_cluster.layout)
>>> ak.to_list(indexedarray)
[[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
 [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]

(If you look at that indexedarray, you'll see its internal structure; that's why I pass it through ak.to_list.)

Then we can build a doubly jagged array using parts of the sim_id_gem_cluster:

>>> new_gem_cluster_layout = ak.layout.ListOffsetArray64(sim_id_gem_cluster.layout.offsets, indexedarray)
>>> ak.to_list(new_gem_cluster_layout)
[[], [], [[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
          [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]]

And that's what we wanted to make, so wrap it up as a new (high-level) array:

>>> new_gem_cluster = ak.Array(new_gem_cluster_layout)
>>> new_gem_cluster
<Array [[], ... {id: 4}, {id: 5}, {id: 6}]]] type='3 * var * var * {"id": int64}'>

Now we can simply slice it.

>>> new_gem_cluster[sim_id_gem_cluster]
<Array [[], ... {id: 5}], [{id: 6}, {id: 2}]]] type='3 * var * var * {"id": int64}'>
>>> new_gem_cluster[sim_id_gem_cluster].tolist()
[[], [], [[{'id': 2}, {'id': 5}], [{'id': 6}, {'id': 2}]]]

This act of "copying" the whole set of clusters for each muon to pick from is not expensive: the IndexedArray is functioning as a set of pointers. We've actually just built a set of pointers with the same multiplicity as the muons that point to the set of all clusters in the same event.

Actually, instead of a StackOverflow post, could you make the above a feature request for the missing high-level function? You shouldn't have to build low-level layouts by hand to solve a problem like this. There ought to be some high-level function for making indexes in sim_id_gem_cluster match up with items in gem_cluster to make new_gem_cluster in one line. The hard part will be determining what a natural interface should be—how you would expect such a function to be named and parameterized. That's where I need to get input from you, because I don't know what would be the most "natural" or "guessable" for you.

For something completely different, here's how you could do it with Numba:

>>> import numba as nb
>>> @nb.jit
... def select(builder, gem_cluster, sim_id_gem_cluster):
...     for clusters, muons in zip(gem_cluster, sim_id_gem_cluster):
...         builder.begin_list()
...         for muon in muons:
...             builder.begin_list()
...             for i in muon:
...                 builder.append(clusters[i])
...             builder.end_list()
...         builder.end_list()
...     return builder
... 
>>> select(ak.ArrayBuilder(), gem_cluster, sim_id_gem_cluster).snapshot()
<Array [[], ... {id: 5}], [{id: 6}, {id: 2}]]] type='3 * var * var * {"id": int64}'>

You can drop the @nb.jit to make this run slowly, but more easily debuggable. With @nb.jit, it tries to compile the contents, which means that any type errors yield lots of output.

The ak.ArrayBuilder is a way to make structured arrays gradually, which is a better fit to for-loop style code. Every begin_list must be accompanied by an end_list, and it does the append at the right level of depth. An ak.ArrayBuilder is not an ak.Array; that's what the .snapshot() is for.

(For all but the most complex problems, I want it to be possible to do it with slices and also to do it with Numba, so that whichever is more natural for a given problem may be used. There are some limitations in Numba, such as "all indexing must be simple indexing: i must be an int." Thus, you're forced to have three nested for loops in this example. They're compiled, so they're fast, but verbose.)

jpivarski commented 3 years ago

it would be nice if awkward-array made a few convenience functions to simplify this.

a = np.random.normal(size=12).reshape(3,4)

Is this one #489?

dildick commented 3 years ago

@nsmith- Thanks. I might be able to use that as a temporary solution until a suitable function becomes available.

dildick commented 3 years ago

@nsmith- So I tried hooking up your solution in my script. I think something is wrong.

If I use this sim_muons array

sim_muons = ak.zip(
    {
        "pt": ak.Array([[], [], [20.0, 30.0, 40]]),
        "cluster_idx": ak.Array([[], [], [[2, 5], [6, 2], [0]]]),
    }
)

I get as output

ak.any(sim_muons.clusters.bx == 0, axis=1)
[[], [], [True, False]]

I was expecting

[[], [], [False, False, True]]
dildick commented 3 years ago

And the output remains the same when I add more muons which match to the same cluster

sim_muons = ak.zip(
    {
        "pt": ak.Array([[], [], [20.0, 30.0, 40, 50, 60]]),
        "cluster_idx": ak.Array([[], [], [[2, 5], [6, 2], [0], [0], [0]]]),
    }
)
dildick commented 3 years ago

Changing to ak.any(sim_muons.clusters.bx == 0, axis=2) seems to do the trick. It now prints

[[], [], [False, False, True, True, True]]

@nsmith- Another question I'm struggling with: how would this work each muon has K index arrays, with each index array matching for a different object? It seems I can only add a single index array. When I add a second one I get an error which originates from the line with the new index array.

  File "/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/py2-awkward1/0.3.1/lib/python2.7/site-packages/awkward1/operations/structure.py", line 337, in zip
    out = awkward1._util.broadcast_and_apply(layouts, getfunction, behavior)
  File "/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/py2-awkward1/0.3.1/lib/python2.7/site-packages/awkward1/_util.py", line 820, in broadcast_and_apply
    out = apply(broadcast_pack(inputs, isscalar), 0)
  File "/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/py2-awkward1/0.3.1/lib/python2.7/site-packages/awkward1/_util.py", line 659, in apply
    outcontent = apply(nextinputs, depth + 1)
  File "/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/py2-awkward1/0.3.1/lib/python2.7/site-packages/awkward1/_util.py", line 708, in apply
    outcontent = apply(nextinputs, depth + 1)
  File "/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/py2-awkward1/0.3.1/lib/python2.7/site-packages/awkward1/_util.py", line 697, in apply
    nextinputs.append(x.broadcast_tooffsets64(offsets).content)
nsmith- commented 3 years ago

Hi, Yeah it should have been a reduction along axis=2, that was my mistake. For a new index array, you have to re-compute the global index with respect to the destination collection. I would suggest making a helper function like:

def embed_crossref(source, idx_name, dest):
    """Embed a cross-reference

    Parameters
    ----------
        source : ak.Array
            any array with shape N * var * {record}
        idx_name : str
            A field in the source record
        dest : ak.Array
            any array with shape N * var * {record}, where:
            ``ak.max(source[idx_name], axis=1) < ak.num(dest)`` and
            ``ak.min(source[idx_name], axis=1) >= 0``
    """
    assert ak.all(ak.max(source[idx_name], axis=1) < ak.num(dest))
    assert ak.all(ak.min(source[idx_name], axis=1) >= 0)

    id_global = ak.flatten(
        source[idx_name] + np.asarray(dest.layout.starts), axis=None
    )
    return ak.Array(
        ak.layout.ListOffsetArray64(
            source.layout.offsets,
            ak.layout.ListOffsetArray64(
                source.layout.content.offsets,
                ak.flatten(dest)[id_global].layout,
            ),
        )
    )

sim_muons["clusters"] = embed_crossref(sim_muons, "cluster_idx", gem_clusters)

A small caveat, its not really any array with the given shape, but restricted to ListOffsetArray or ListArray types, which is usually the case when extracting fresh arrays from a ROOT file.

nsmith- commented 3 years ago

If you're interested, coffea NanoEvents allows you to do things like the above, with the addition of lazy access, which means the destination collections of such cross references are not read until you need them. You can create a custom Schema class deriving from https://github.com/CoffeaTeam/coffea/blob/master/coffea/nanoevents/schemas.py#L76 for your nTuple format once and then skip the boilerplate like this for subsequent work.

jpivarski commented 3 years ago

@nsmith-, getting back to the original request of slicing a jagged array using a doubly jagged array, is this something that NanoEvents can do? I'm not asking about completely general types, but for the case @dildick originally wanted. If so, I'd rather close this and leave this to NanoEvents.

(My list of tasks is a mile long!)

nsmith- commented 3 years ago

Well I provided a reasonably general solution in the self-contained embed_crossref function above, and there is a lazy equivalent in NanoEvents, I'm not sure if that qualifies? I would leave it to @dildick to clarify if these solutions are sufficient. Perhaps I could offer one takeaway: given layout is semi-private, maybe there should be a public method to create new jagged arrays from counts or offsets, a la ak0's JaggedArray.from_counts or JaggedArray.from_offsets.

dildick commented 3 years ago

@nsmith, so I tried your solution. Strangely enough, the procedure works outside an encapsulating function, but once in a separate function, it does not... I have been trying to figure out what happens. Also, when I include extra index arrays, e.g. for other matching objects, the broadcasting fails

dildick commented 3 years ago

I made a slight change to the function, added "dest_name" so that

def embed_crossref(source, idx_name, dest, dest_name):
    """Embed a cross-reference

    Parameters
    ----------
        source : ak.Array
            any array with shape N * var * {record}
        idx_name : str
            A field in the source record
        dest : ak.Array
            any array with shape N * var * {record}, where:
            ``ak.max(source[idx_name], axis=1) < ak.num(dest)`` and
            ``ak.min(source[idx_name], axis=1) >= 0``
    """

    assert ak.all(ak.max(source[idx_name], axis=1) < ak.num(dest))
    assert ak.all(ak.min(source[idx_name], axis=1) >= 0)

    id_global = ak.flatten(
        source[idx_name] + np.asarray(dest.layout.starts), axis=None
    )
    source[dest_name] = ak.Array(
        ak.layout.ListOffsetArray64(
            source.layout.offsets,
            ak.layout.ListOffsetArray64(
                source.layout.content.offsets,
                ak.flatten(dest)[id_global].layout,
            ),
        )
    )

so I call it with

embed_crossref(sim_muon, "sim_id_gem_cluster", gem_cluster, "clusters")

and that crashes with

Traceback (most recent call last):
  File "flatPlots.py", line 200, in <module>
    embed_crossref(sim_muon, "sim_id_gem_cluster", gem_cluster, "clusters")
  File "/uscms_data/d3/dildick/work/NewCSCTriggerPatterns/CMSSW_11_2_0_pre7/src/GEMCode/GEMValidation/scripts/objects.py", line 35, in embed_crossref
    assert ak.all(ak.max(source[idx_name], axis=1) < ak.num(dest))
AssertionError
nsmith- commented 3 years ago

Hm for me it was ok, the assertions just guard against indexes out of bounds (so maybe your nTuples have some bug?) I guess they could be simplified though, to

    assert ak.all(source[idx_name] < ak.num(dest))
    assert ak.all(source[idx_name] >= 0)

or they could be removed if you want to live dangerously (and possibly have incorrect results)

jpivarski commented 3 years ago

This has been on the back of my mind for a long time (a lot of things have, actually), and I think I have a solution (in the jpivarski/SliceVarNewAxis branch, which will be merged into jpivarski/fixes-for-scipy-prep, which will be merged into main—it was a long digression).

Regular dimensions of length 1 can now be mixed with variable-length dimensions in an array used as a slice. Regular dimensions of length 1 are created by np.newaxis or None, which is the NumPy idiom for the same behavior, except that we hadn't allowed it to mix with variable-length dimensions until now. (It's changing what used to be an error into a usable idiom.)

For example, if

>>> array = ak.Array([[[ 0,  1,  2,  3,  4],
...                    [ 5,  6,  7,  8,  9],
...                    [10, 11, 12, 13, 14]],
...                   [[15, 16, 17, 18, 19],
...                    [20, 21, 22, 23, 24],
...                    [25, 26, 27, 28, 29]]])
>>> array
<Array [[[0, 1, 2, 3, 4, ... 26, 27, 28, 29]]] type='2 * var * var * int64'>

is our array and

>>> slicer = ak.Array([[3, 4], [0, 1, 2, 3]])
>>> slicer
<Array [[3, 4], [0, 1, 2, 3]] type='2 * var * int64'>

is what we're going to use to slice it, notice that the array is 2 * var * var * int64 and the slicer is 2 * var * int64. To make them compatible, we need to add a dimension to the slicer:

>>> slicer[:, np.newaxis]
<Array [[[3, 4]], [[0, 1, 2, 3]]] type='2 * 1 * var * int64'>

Now the outermost 2 matches and the innermost var matches, but we have a 1 in slicer where we have var in array. This is now legal:

>>> array[slicer[:, np.newaxis]]
<Array [[[3, 4], [8, 9, ... [25, 26, 27, 28]]] type='2 * var * var * int64'>
>>> array[slicer[:, np.newaxis]].tolist()
[[[3, 4],
  [8, 9],
  [13, 14]],
 [[15, 16, 17, 18],
  [20, 21, 22, 23],
  [25, 26, 27, 28]]]

The same applies for boolean arrays:

>>> slicer = ak.Array([[False, False, False,  True,  True],
...                    [ True,  True,  True,  True, False]])
>>> array[slicer[:, np.newaxis]]
<Array [[[3, 4], [8, 9, ... [25, 26, 27, 28]]] type='2 * var * var * int64'>
>>> array[slicer[:, np.newaxis]].tolist()
[[[3, 4],
  [8, 9],
  [13, 14]],
 [[15, 16, 17, 18],
  [20, 21, 22, 23],
  [25, 26, 27, 28]]]

I think this meets the need you described above. Anyway, I'll close the issue now and if you disagree, you can reply and I'll hear it. Or open a new issue or Discussion. Thanks!

dildick commented 3 years ago

Thanks Jim! We will give it a try.

jpivarski commented 2 years ago

See discussion #1027; this feature was ill-conceived and led to unexpected interactions. I'll be removing it in awkward>1.4.0; these slices are still possible, but we'll have to explicitly set up the slicer to fit the array to be sliced, rather than taking regular length-1 dimensions to implicitly perform this. Maybe new functions will be needed, which is what you suggested when you first opened the issue.