scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
806 stars 81 forks source link

`remove_diagonal` option for `ak.cartesian` and `ak.argcartestian` #3143

Open lgray opened 3 weeks ago

lgray commented 3 weeks ago

Description of new feature

It is a common operation in particle physics that we'll want a cartestian product with the diagonal removed. This occurs in "tag-and-probe" workflows used to measure physics object reconstruction efficiencies where we want to consider tagged objects also as probes to increase statistics and remove possible sources of population biases.

Here it is never the case that we want the diagonal of a cartesian product and right now it has to be removed by:

ij = ak.argcartesian([ele_for_tnp, ele_for_tnp])
is_not_diag = ij["0"] != ij["1"]
i, j = ak.unzip(ij[is_not_diag])
zcands = ak.zip({'tag': ele_for_tnp[i], 'probe': ele_for_tnp[j]})

Which costs tasks in dask-awkward and is generally distracting from physics intent of code. It would be better if, instead, we could:

ij = ak.argcartesian({"itag": ele_for_tnp, "jprobe": ele_for_tnp}, remove_diagonal=True)

and collapse all this to a single operation that also makes the intent more clear.

lgray commented 3 weeks ago

@ikrommyd

jpivarski commented 3 weeks ago

This is a good idea. (That's what the argument is named in ak.combinations and ak.argcombinations, right?) The only thing that's a little odd about it is that there's only a "diagonal" if the input collections are the same collection. Another way of looking at this is that it could be adding a "include lower triangular" in ak.combinations. Should remove_diagonal fail if the inputs don't have the same length per entry?

The best way to implement it is as a post-processing step, taking advantage of the fact that the output has a predictable shape per entry, like nested. Will @ikrommyd be implementing it? I can point out where it would go, how to do the removal in the code.

lgray commented 3 weeks ago

I @'d Iason because he's run into this a few times.

Perhaps a more natural place to put this is in ak.combinations(..., replacement=True) where there is always a diagonal that means what you think it means, and sometimes you want to drop it.

skip_self as a parameter name comes to mind.

ikrommyd commented 3 weeks ago

I'd say it sounds more natural to me as a part of ak.combinations. I say that because when I wanted this, I instantly went to the combinations doc page and not the cartesian doc page to see if it was implemented. The cartesian of a collection with itself to remove the diagonal is a workaround I think. Both remove_diagonal and skip_self in ak.combinations sound good to me and I wouldn't have a hard time understanding what they do. What I did originally was doing was doing combinations two times and concatenating which doesn't work that great with dask-awkward and global indexes. I didn't even think of the cartesian until Angus showed me so yeah I'd say combinations is the best place imo.

jpivarski commented 3 weeks ago

How would you feel about having an idiomatic construction, instead of a library feature?

Given

numbers = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

this makes the upper triangle of each entry (at axis=1):

tuples = ak.combinations(numbers, 2, axis=1)

The lower triangle can be constructed by swapping the order of fields in the tuple, through an ak.unzip, reverse order of the Python tuple, followed by ak.zip.

ak.zip(ak.unzip(tuples)[::-1])

Since we want both the lower triangle and the upper triangle, without the diagonal, we concatenate them at the same axis as was used in ak.combinations.

ak.concatenate([tuples, ak.zip(ak.unzip(tuples)[::-1])], axis=1)

A potential problem with this is the order, if that matters. The original pairs scan through the matrix as right-then-down (English reading order), but the lower triangle is constructed as down-then-right (Mongolian).


If that's a problem, then you can construct this from ak.argcartesian (must be "arg") with a filter:

tuples = ak.argcartesian([numbers, numbers])
no_diagonal = tuples[tuples["0"] != tuples["1"]]

And then the "arg" version can be applied to the original data to get the non-"arg":

ak.zip([numbers[no_diagonal["0"]], numbers[no_diagonal["1"]]])

The method through ak.argcartesian also generalizes beyond pairs (n=2) more readily, though Python loops or list comprehensions would be needed.

If we are to implement it in Awkward (I'm leaning more toward making it an argument of ak.combinations now), then we'd use a method like this to do it. Probably with ak.argcartesian because we'd need it to be general for it to go into a library.


In fact, @lgray, I know that you need ak.combinations for one of the 8 GPU tests, and ak.combinations is implemented with a kernel. ak.cartesian is not implemented with a kernel; it uses a trick @nsmith- came up with to implement it in terms of broadcasting. Broadcasting is complicated and involves some kernels, so I don't know if you'll run into trouble if you try replacing ak.combinations with

tuples = ak.argcartesian([numbers, numbers])
upper_triangle = tuples[tuples["1"] > tuples["0"]]
ak.zip([numbers[upper_triangle["0"]], numbers[upper_triangle["1"]]])
lgray commented 3 weeks ago

This produces more intermediates and is honestly a little confusing compared to having a parameters that clearly alters behavior in a specific way.

jpivarski commented 3 weeks ago

Re: intermediates—maybe you don't want to look at the way many of the functions are implemented...

But I agree that a function parameter is a lot more intuitive and discoverable than idioms like this. This remains an open feature request and I think I know which student's project it would fit under.

The implementation would very likely be the one based on ak.argcartesian, generalized to any n and implemented within ak.combinations/ak.argcombinations. From the perspective of array copies, you can count how many that would be, but from the perspective of Dask nodes, being implemented within a high-level function means that it wouldn't be any additional Dask nodes. We've come to learn that minimizing the number of Dask nodes is crucial.

ikrommyd commented 3 weeks ago

Well there is a problem with concatenating in dask-awkward which is why a separate function argument in ak.combinations would be good. If I concatenated on axis 0, then I had to also concatenate on axis 0 everything else that I wanted to keep the same length as my pairs. If I remember correctly, there was also an issue with global indexing there. If I concentrated on axis 1, that made my code 3 times slower and Martin said that this was because you shuffle a lot of data between partitions that way.