Open oliviermattelaer opened 3 years ago
Hi Olivier, thanks for opening this, it will be useful to collect some studies.
I do not remember actively changing from 3 to 4 momenta myself, but maybe I am wrong, I think that the present code is aligned to the original MadGraph.
In any case I agree that there is a compromise to be done. Probably passing 3-momenta is heavier in recomputations and registers in the GPU. Passing 4-momenta on the other hand is heavier on copies: at the moment, and for eemumu, this is especially heavy on the copy of rambo outputs to the CPU. As discussed elsewhere (eg #22) the relative importance of this will decrease when we go to more complex processes, and it will also decrease when we do a realistic event unweighting on the GPU, so that we do not copy all events to the CPU, but only those which passed hit-or-miss criteria.
It's likely that you branch out before some of my latest change. But for this one it is indeed not that relevant (and indeed the importance will further decrease in the future).
On the other-hand, I have kept the "low memory mode" for the color-matrix computation --even if in term of performance the result were not that great--
So by looking at the code
1) we have an issue with the ixxxxx routine that we need to fix (inconsistent handling of 4/3 momenta --at least in epoch2-- 2) The code sometimes read the full 4-momenta and sometimes only part of it (three momenta or even less)
Hi @oliviermattelaer I am still reviewing old tickets.
I am not sure if this one is still relevant. I assume this is about the XXX routines rather than the FFV routines, right? Please note that a few months ago I went through all xxx routines, I cross checked that the simpler versions (imz/ipz/ixz) agree with the full versions (ixx), I added some tests with a reference file, and I also tried to check that the c++ versions agree with the fortran.
A couple of more specific comments:
the latest code is now hardcoded in code generation here https://github.com/madgraph5/madgraph4gpu/blob/golden_epochX4/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/aloha/template_files/gpu/helas.cu
I added comments there that should be self explanatory, about a few changes I did and why, and a few doubts I had, and about what I understood the assumptions are behind the simplified versions
the pz/mz/xz versions always assume fmass=0 (and have no fmass input argument in the signature)
the pz/mz versions also assume pT=0, so they use only one component pz (not px, py or E)
the xz versions assume pT>0; while there are only 3 free components out of 4 (as mass=0), all 4 are used in the calculation, which is probably easier than recomputing the fourth one as sqrt anyway! note that i added what i thought was a bug fix, https://github.com/madgraph5/madgraph4gpu/blob/e41c14202e631bb87e1f514d7d116252dfc1dac4/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/aloha/template_files/gpu/helas.cu#L314-L317 (and also checked that it gives the same results as ixxxxx)
the full ixxxxx version with mass not 0 uses all 4 components; note that i did another bug fix as the code looked different from the fortran, https://github.com/madgraph5/madgraph4gpu/blob/e41c14202e631bb87e1f514d7d116252dfc1dac4/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/aloha/template_files/gpu/helas.cu#L73-L74 (here indeed it was using 3 components instead of 4)... anyway avoiding a sqrt is pronbably a good idea
note that sxxxxx has two parameters unused (mass and helicity), maybe it would be easier to remove them as in the fortran (but sxxxx is not used in our eemumu/ggttgg examples, so I have no idea how to test that)
Finally, this issue #58 that you opened may well be a duplicate of my question #200 ? The latter is about the question I asked above, the ixxxxx implementation in https://github.com/madgraph5/madgraph4gpu/blob/e41c14202e631bb87e1f514d7d116252dfc1dac4/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/aloha/template_files/gpu/helas.cu#L73-L74
I keep this open tioll clarified (I guess I could close #200 as duplicate, but I keep that open as well)
This is something that it would be great to revisit later (and therefore keep this open). The question is should we pass E,px,py,pz,m as input or just px,py,pz,m going for the second reduce the amount memory to transfer from cpu to gpu but this increase the amount of work that the GPU has to do.
Just want to create an issue to track some development that we did in the past.
At some pooint during the hackaton, we tried to pass 3 vector instead of 4 vector (in additionn of the mass value). The up side is clear (reduction of 25% of the memory footprint), the downside being that we need to recompute the energy compotnent obviously --and a square root is heavy in term of register/instruction--)
Looks like the version of andrea move back to the 4 vector transfer. We might want to revisit that idea and check again the two methods