PFCandidate <-> element many-to-many association

jpata / particleflow

Machine-learned, GPU-accelerated particle flow reconstruction

Apache License 2.0

24 stars 29 forks source link

PFCandidate <-> element many-to-many association #1

Closed jpata closed 5 years ago

jpata commented 5 years ago

Here's an attempt to put private discussions on ML-PF in public.

For ML-PF studies, an idea was to exploit the fact that in the standard PF algo, PFBlock-> elements -> candidates, such that a small set of candidates is expected to produce a single PF candidate. Looking at the output of the algo, I generally find using PFCandidate::elementsInBlocks that the association can be many-to-many, in the sense that one element may be associated with multiple candidates, and one candidate with multiple elements.

Just as an example, here is a partial event from the debugging ntuplizer on a RelValTTbar_13 PU25ns_110X_upgrade2018_realistic_v3 file:

cmsRun test/step3.py
python test/ntuplizer.py ./data/ step3_AOD.root
root -l ./data/step3_AOD.root
root [2] pftree->Scan("clusters_npfcands:clusters_ipfcand0:clusters_ipfcand1:clusters_ipfcand2:clusters_ipfcand3")
***********************************************************************************
*    Row   * Instance * clusters_ * clusters_ * clusters_ * clusters_ * clusters_ *
***********************************************************************************
*        0 *        0 *         3 *       381 *       827 *       974 *         0 *
*        0 *        1 *         1 *      2125 *         0 *         0 *         0 *
*        0 *        2 *         1 *       814 *         0 *         0 *         0 *
*        0 *        3 *         1 *       449 *         0 *         0 *         0 *
*        0 *        4 *         3 *       365 *       626 *       709 *         0 *
*        0 *        5 *         3 *       625 *       637 *       767 *         0 *
*        0 *        6 *         2 *       120 *       484 *         0 *         0 *
*        0 *        7 *         2 *       965 *      1057 *         0 *         0 *
*        0 *        8 *         6 *       307 *       470 *       823 *       890 *

Will need to look in more detail to understand if this is really what PFAlgo is doing, or perhaps some artifact.

cc @jmduarte @lgray @vlimant @pierinim, feel free to include others.

jpata commented 5 years ago

So it seems the graphs produced by elementsInBlocks (PFCand <-> elem) are many to many. I'm not sure if this is a good ground truth, but perhaps better than nothing with respect to starting with the full event.

The full event:

The full event with element to candidate links:

A selection of 16 subgraphs from the event, we see that clusters are shared between candidates:

red: tracks
green: clusters
black: candidates

Subgraph size distribution, size=0 corresponding to a single track/cluster with no PF candidate, size=1 with a single track/cluster associated producing a single candidate, and so forth:

pierinim commented 5 years ago

Hello

Thanks Joseph. Let me take a step back, because I am missing the intermediate pieces here.

The problem seems similar to what is discussed in the context of hgcal these days (Jan et al are struggling with this since a while but I was told that there are progresses). I have the impression that the ground truth is not well defined with gen (which seems to be the ultimate goal) because you don’t know where to ‘cut it’ (eg before or after brem for electrons).

What you say (or some other variant of it, if people have ideas) is reasonable to me. In general, with a many to many problem I don’t know what fixes the output ‘ many’.

One could take the pfreco as truth (I think Javier mentioned this). But if you do so you need to keep maintaining the old code in any case. So this cannot be a solution.

I have the impression (and I think this is also your opinion because this is where you are pointing the finger) that the ultimate algorithm needs to take decisions (e.g. split / merge kind of decisions) that can certainly be turned into a classifier. But who decides to invoke the classifier?

I am wondering if this is not a use case for a RL agent to be trained on this problem. A source of inspiration could be the work by Casarsa et al. on jet trimming with RL. It’s a different conceptual problem but it has some commonality.

I don’t know if you guys were thinking in this direction. This might be overdoing.

What I had in mind was a way lower level approach, targeted Calojets-> pfjets ‘morphing’. So, in general there is no problem of merging/splitting in that case.

I think that we should have a discussion on this to understand where you guys stand and what was done. Then we can see where we could contribute. This was not in my plans but plans can be changed. I have three phd students starting in the next few months so one of them could put a considerable fraction of time on this.

Cheers

Maurizio

On 4 Oct 2019, at 06:31, Joosep Pata notifications@github.com wrote:

So it seems the graphs produced by elementsInBlocks (PFCand <-> elem) are many to many. I'm not sure if this is a good ground truth, but perhaps better than nothing with respect to starting with the full event.

The full event:

The full event with element to candidate links:

A selection of 16 subgraphs from the event, we see that clusters are shared between candidates:

red: tracks green: clusters black: candidates

Subgraph size distribution, size=0 corresponding to a single track/cluster with no PF candidate, size=1 with a single track/cluster associated producing a single candidate, and so forth:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lgray commented 5 years ago

As an exercise, using PF reco as truth to see if you can reproduce the algorithm with good fidelity would be useful as a test of the model being appropriate. I think it's worth some thought.

For particle flow ground truth should already be more OK than for calorimetry. We don't care so much what happens inside the calorimeters so long as the clusters and tracks (eventually) match back to the same originating sim track. How things evolve in the calorimeter is less concerning than in the case of HGCal.

In the case of an edge classification network: A many-to-many association is possible if you have a many category output per edge and use the assignment probabilities from a softmax define the fractions. The loss function would be similar to the energy residual used in gravnet.

I need a bit more time to digest the graphs you show as examples.

pierinim commented 5 years ago

Hello

As an exercise, using PF reco as truth to see if you can reproduce the algorithm with good fidelity would be useful as a test of the model being appropriate. I think it's worth some thought.

I agree. The advantage of this is that we can train on data. The disadvantage, as said, is that one still needs the old-style code to be maintained and executed at least on the training-dataset definition step

For particle flow ground truth should already be more OK than for calorimetry. We don't care so much what happens inside the calorimeters so long as the clusters and tracks (eventually) match back to the same originating sim track. How things evolve in the calorimeter is less concerning than in the case of HGCal.

You have a point there. It might be indeed conceptually easier. basically, one would take as input the post-pythia list of stable particles.

In the case of an edge classification network: A many-to-many association is possible if you have a many category output per edge and use the assignment probabilities from a softmax define the fractions. The loss function would be similar to the energy residual used in gravnet.

mmhhh… need to think about this

I need a bit more time to digest the graphs you show as examples.

Maurizio

jmduarte commented 5 years ago

My first thought is just to let the many-to-many associations define the truth, and just connect all edges if the two pf elements (tracks or clusters) share any pf candidate.

Ideally, the network will then learn to predict all edges that connect elements that share any pf candidate.

Of course, then you might need some sophisticated post-processing to get the list of pf candidates out.

For example, here's what a graph of 100 clusters looks like, where blue = initially connecting those with dR<1 and red = truth based on if they share any pf candidate: graph_clusers_eta_clusters_phi

pierinim commented 5 years ago

I see…

I really think that we need a meeting among ourselves to understand the various options to try and how to share the work

On 4 Oct 2019, at 13:47, Javier Duarte notifications@github.com wrote:

My first thought is just to let the many-to-many associations define the truth, and just connect all edges if the two pf elements (tracks or clusters) share any pf candidate.

Ideally, the network will then learn to predict all edges that connect elements that share any pf candidate.

Of course, then you might need some sophisticated post-processing to get the list of pf candidates out.

For example, here's what a graph of 100 clusters looks like, where blue = initially connecting those with dR<1 and red = truth based on if they share any pf candidate: https://user-images.githubusercontent.com/4932543/66181807-1875d800-e627-11e9-8ec0-105850cdb78b.png — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jpata/particleflow/issues/1?email_source=notifications&email_token=AASCLFHHPQL4LWK2APGQTELQM3DGBA5CNFSM4I5GRYY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAKMEYA#issuecomment-538231392, or mute the thread https://github.com/notifications/unsubscribe-auth/AASCLFCKY63J3YSJJZRAEPTQM3DGBANCNFSM4I5GRYYQ.

lgray commented 5 years ago

@jmduarte Yes! A list of edges that can contain many PFCandidates is more or less a PFBlock. I was chatting with @jpata over emails that one of the first steps could be that we try to add an "afterburner" to PFBlockAlgo. Similar thought and creates a nice foothold within the reconstruction algorithm.

Having the "sophisticated post-processing" just be particle flow is a fine option in my view and creates a more continuous flow for the development. There are some nice execution timing benefits this could yield since each block is smaller and the combinatorics in PFAlgo would be smaller. Maybe some physics benefits too since it could also make a few pieces of the algo more stable.

lgray commented 5 years ago

@pierinim

The disadvantage, as said, is that one still needs the old-style code to be maintained and executed at least on the training-dataset definition step

That disadvantage is certainly noted in what I was saying. Given that we're probably not going to get rid of the original PFAlgo for a while we may as well make the most of it.

basically, one would take as input the post-pythia list of stable particles.

Since PF is presently very bad at reconstructing nuclear interaction, so the geant sim track list is also a reasonable candidate for "truth". It will be interesting to see how well we can do there and what sort of fits best. Clearly, if we can get all the way back to generator truth we should do it! :-)

lgray commented 5 years ago

@jpata How far does the tail go out in sub-graph size?

pierinim commented 5 years ago

Hello

This sounds to me as a very reasonable strategy to get to something concrete in a reasonably short amount of time. One could then think to get deeper in replacing previous steps of the pf algo. But I agree that we don’t need to go directly there because the pf algo as it is will stay with us for a while

Maurizio

On 4 Oct 2019, at 22:48, Lindsey Gray notifications@github.com wrote:

@pierinim

The disadvantage, as said, is that one still needs the old-style code to be maintained and executed at least on the training-dataset definition step

That disadvantage is certainly noted in what I was saying. Given that we're probably not going to get rid of the original PFAlgo for a while we may as well make the most of it.

basically, one would take as input the post-pythia list of stable particles.

Since PF is presently very bad at reconstructing nuclear interaction,s the geant sim track list is also a reasonable candidate for "truth". It will be interesting to see how well we can do there and what sort of fits best. Clearly, if we can get all the way back to generator truth we should do it! :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jpata commented 5 years ago

Thanks for your thoughts! Let me try to reply one by one.

@pierinim:

One could take the pfreco as truth (I think Javier mentioned this). But if you do so you need to keep maintaining the old code in any case. So this cannot be a solution.

I agree with you and others here that using the existing PFAlgo as a learning target is a first step, likely it will need to be around for a while. Discussing with @vlimant some time ago the advantage of training on data also came up. Ultimately, the truth should be defined by gen-particles in a similar way as is done in e.g. HGCAL.

I am wondering if this is not a use case for a RL agent to be trained on this problem. A source of inspiration could be the work by Casarsa et al. on jet trimming with RL. It’s a different conceptual problem but it has some commonality.

RL didn't cross my mind so far, but thanks for the reference, let me read it.

What I had in mind was a way lower level approach, targeted Calojets-> pfjets ‘morphing’. So, in general there is no problem of merging/splitting in that case.

Sure, I'm interested particularly in targetting offline PF (ML or no ML), but if you demonstrate that such an approach works in scouting, all the better...

I have three phd students starting in the next few months so one of them could put a considerable fraction of time on this.

Sounds great if we can get someone at CERN interested in this!

@jmduarte

Of course, then you might need some sophisticated post-processing to get the list of pf candidates out.

I had come to this thought too, and turns out it's not too hard. Using networkx, we can filter the subgraphs and nodes in various ways, this function now takes care of it. So now we have ground truth "miniblocks" - the sets of elements that are connected to each other through PFCandidates. Usually (95% of miniblocks), they are of size 1-3, but some extend to size 100, I guess these could well be pathological. Ideally, the existing PFBlockAlgo could be tuned to produce similar miniblocks.

@lgray Have a look at slide 6 on the attached PDF, it's up to a 100, but a steeply falling spectrum. I'm optimistic that a simple cleaner of miniblocks of up to size 3 can take care of much of the reconstruction.

I collected some materials about what we currently have here: 2019_10_07_pf_datastructures.pdf. In particular, I think the ML problem can also be factorized into clustering and regression, which based on the existing PFAlgo output, could already be tackled separately.

Let's aim to chat by voice on Thursday as well.

I think this discusson issue of the many-to-many association could be considered closed, in the sense that we can use the PFCandidate::elementsInBlocks with postprocessing to produce reasonable ground truth in the form of miniblocks for a graph network.

lgray commented 5 years ago

@jpata Don't think you need networkx to start training a basic edge classifier. The truth is a one hot vector on the edge list. Can use networkx to filter the prediction after running the inference. However, all that's needed is the Fast-Union algorithm (already in CMSSW).

If we want DGCNN/gravnet variants we'll first need a network that doesn't require the number of clusters to be known apriori.

For feature learning, I think we'll want something where for each PFBlockElement type we define the low level features and explode to the same latent space dim between them. It'll need an input layer for each PFBlockElement type, but that's not a big deal.

pierinim commented 5 years ago

Hello Joosep

Could you put your thoughts on slides for Thursday

Jean-Roch and I had a mini-brainstorming yesterday and came out with 4 items that we would like to discuss as four possible common projects. I will try to put slides together on this

Cheers

Maurizio

On Oct 8, 2019, at 1:58 AM, Joosep Pata notifications@github.com wrote:

Thanks for your thoughts! Let me try to reply one by one.

@pierinim:

One could take the pfreco as truth (I think Javier mentioned this). But if you do so you need to keep maintaining the old code in any case. So this cannot be a solution.

I agree with you and others here that using the existing PFAlgo as a learning target is a first step, likely it will need to be around for a while. Discussing with @vlimant some time ago the advantage of training on data also came up. Ultimately, the truth should be defined by gen-particles in a similar way as is done in e.g. HGCAL.

I am wondering if this is not a use case for a RL agent to be trained on this problem. A source of inspiration could be the work by Casarsa et al. on jet trimming with RL. It’s a different conceptual problem but it has some commonality.

RL didn't cross my mind so far, but thanks for the reference, let me read it.

What I had in mind was a way lower level approach, targeted Calojets-> pfjets ‘morphing’. So, in general there is no problem of merging/splitting in that case.

Sure, I'm interested particularly in targetting offline PF (ML or no ML), but if you demonstrate that such an approach works in scouting, all the better...

I have three phd students starting in the next few months so one of them could put a considerable fraction of time on this.

Sounds great if we can get someone at CERN interested in this!

@jmduarte

Of course, then you might need some sophisticated post-processing to get the list of pf candidates out.

I had come to this thought too, and turns out it's not too hard. Using networkx, we can filter the subgraphs and nodes in various ways, this function now takes care of it. So now we have ground truth "miniblocks" - the sets of elements that are connected to each other through PFCandidates. Usually (95% of miniblocks), they are of size 1-3, but some extend to size 100, I guess these could well be pathological. Ideally, the existing PFBlockAlgo could be tuned to produce similar miniblocks.

@lgray Have a look at slide 6 on the attached PDF, it's up to a 100, but a steeply falling spectrum. I'm optimistic that a simple cleaner of miniblocks of up to size 3 can take care of much of the reconstruction.

I collected some materials about what we currently have here: 2019_10_07_pf_datastructures.pdf. In particular, I think the ML problem can also be factorized into clustering and regression, which based on the existing PFAlgo output, could already be tackled separately.

Let's aim to chat by voice on Thursday as well.

I think this discusson issue of the many-to-many association could be considered closed, in the sense that we can use the PFCandidate::elementsInBlocks with postprocessing to produce reasonable ground truth in the form of miniblocks for a graph network.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.