Cross section mismatch in pp_tt012j (P2_gu_ttxgu) in CI tmad tests - reset_cumulative_variable was called twice in fortran and once in cudacpp

There is a cross section mismatch in pp_tt012j in CI tmad tests.

This only appears after fixing the rotxxx crash (which otherwise hides it). I have seen this in the CI for PR #857, where volatile is added to fix rotxxx. https://github.com/madgraph5/madgraph4gpu/actions/runs/9694817881/job/26753418805

*** (2-none) Compare MADEVENT_CPP xQUICK xsec to MADEVENT_FORTRAN xsec ***

ERROR! xsec from fortran (0.82047507505698292) and cpp (0.67821465194602271) differ by more than 3E-13 (0.1733878669026967)

I forgot to mention: the CI now runs tmad tests in each P* subprocess. The issue appears in P2_gu_ttxgu. (Other subprocesses succeed).

I am having a look at this.

I confirm that the CI fails in P2_gu_ttxgu. This is a recent run in PR #934 where the pptt012j tests are the only pending issues https://github.com/madgraph5/madgraph4gpu/actions/runs/10045773913/job/27763863225?pr=934


*******************************************************************************
*** tmad_test pp_tt012j.mad (P2_gu_ttxgu)
*******************************************************************************

Testing in /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/pp_tt012j.mad/SubProcesses/P2_gu_ttxgu

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 32
32/64
64
 [XSECTION] VECSIZE_USED = 32
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 0.8205 [0.82047507505698292] fbridge_mode=0
 [UNWEIGHT] Wrote 5 events (found 37 events)
 [COUNTERS] PROGRAM TOTAL          :    0.0631s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.0497s
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0134s for      128 events => throughput is 9.54E+03 events/s

*** (1) EXECUTE MADEVENT_FORTRAN xQUICK (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 32
32/64
64
 [XSECTION] VECSIZE_USED = 32
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 0.8205 [0.82047507505698292] fbridge_mode=0
 [UNWEIGHT] Wrote 7 events (found 36 events)
 [COUNTERS] PROGRAM TOTAL          :    0.0660s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.0514s
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0145s for      128 events => throughput is 8.81E+03 events/s

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 32/64
 [XSECTION] VECSIZE_USED = 32
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 0.6782 [0.67821465194602271] fbridge_mode=1
 [UNWEIGHT] Wrote 4 events (found 26 events)
 [COUNTERS] PROGRAM TOTAL          :    0.0589s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.0519s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0070s for       96 events => throughput is 1.38E+04 events/s

*** (2-none) Compare MADEVENT_CPP xQUICK xsec to MADEVENT_FORTRAN xsec ***

ERROR! xsec from fortran (0.82047507505698292) and cpp (0.67821465194602271) differ by more than 3E-13 (0.1733878669026967)

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tmad_test (pp_tt012j.mad) finished with status=1 (NOT OK) at Mon Jul 22 18:02:30 UTC 2024
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Error: Process completed with exit code 1.

Hi @valassi , very good, I also started looking into it this morning, but please go ahead.

FYI I just noticed a small issue with the CI scripts, not sure why it appears only now? I.e. when I have nvcc in my PATH then, when running the CI "by hand" from the command line, it stops working at some stage. IIRC at the tput_test step, because it looks for a cuda build which hasn't been done. I believe I notice this the first time. Taking out the nvcc location from the PATH then it builds and runs fine of course (for CPU).

FYI I just noticed a small issue with the CI scripts, not sure why it appears only now? I.e. when I have nvcc in my PATH then, when running the CI "by hand" from the command line, it stops working at some stage. IIRC at the tput_test step, because it looks for a cuda build which hasn't been done. I believe I notice this the first time. Taking out the nvcc location from the PATH then it builds and runs fine of course (for CPU).

I made some changes to some scripts to exclude ggttggg builds on HIP due to #933. But this should only appear in my own branches eg PR #934 (and I think I did not even push it yet). So I do not understand what you refer to. If you have a concrete example and reproducer please open a ticket.

I am investigating this in WIP PR #935

One part of the problems seems to be that "gu_ttxgu" code is generated differently when as an individual process and when part of a multileg process

within multileg pp_tt012j the cross section comparison fails (this bug #872 )
in a standalone gu_ttgu or gq_ttgq (with an dwithout antiquarks) the cross section is identical I will also try a fixed-leg pp_ttjj...

These are the snippets

    gu_ttgu) # debug #872 in pp_tt012j
      cmd="generate g u > t t~ g u"
      ;;
    gq_ttgq) # debug #872 in pp_tt012j
      ###cmd="define q = u c d s u~ c~ d~ s~; generate g q > t t~ g q"
      cmd="define q = u c d s; generate g q > t t~ g q"
      ;;
    pp_tt012j)
      cmd="define j = p
      generate p p > t t~ @0
      add process p p > t t~ j @1
      add process p p > t t~ j j @2"
      ;;

I made some changes to some scripts to exclude ggttggg builds on HIP due to #933. But this should only appear in my own branches eg PR #934 (and I think I did not even push it yet). So I do not understand what you refer to. If you have a concrete example and reproducer please open a ticket.

I have opened just a new #936 with some observations, nothing of those is urgent but still I think worth having a look.

I have added manual scripts for pp_tt012j and this fails as expected. Strangely however this does not fail with 8192 events, fails with 10x times that.

I have also added pp_ttjj and this is very intresting. It also fails like pp_tt012j, but in a different way: pptt012j and ppttjj have the same code but result in different cross sections?? The snippet is here

    pp_ttjj)
      cmd="define j = p;
      generate p p > t t~ j j"
      ;;
    pp_tt012j)
      cmd="define j = p
      generate p p > t t~ @0
      add process p p > t t~ j @1
      add process p p > t t~ j j @2"
      ;;

My impression is that somehow the reproducibility event by event gets lost in these processes with subprocesses. What is particularly concerning is that the numbers of events processed in fortran does not seem to be what it should be. Then if a different number of events is processed in fortran anc cpp, it is quite obvious that cross sections are not bit by bit identical.

My feeling so far (I might be wrong) is that there is some issue that does not affect physics (?), but affects and is extremely bad for software testing. To be continued...

(PS well that it does not affect physics remains to be seen... I really do not understand why delegating an smatrix1 call to fortran instead of cpp should result in different numbers of events processed, all other things being equal including random numbers... I hope this is not again another difference in helicities...)

I designed the tmad scripts to execute the same numbers of events on Fortran and CUDA/C++. Either 8192 or 8192+10x8192 for longer tests.

This is the case for all processes, but not for the two pp processes. Hm. Is this a pdf effect in the tests?

\grep throughput tmad/logs_*_mad/log_*_mad_d_inl0_hrd0.txt | awk '{print $1, $11, $12}'

tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttgq_mad/log_gqttgq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_guttgu_mad/log_guttgu_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_pptt012j_mad/log_pptt012j_mad_d_inl0_hrd0.txt: 24576 events
tmad/logs_pptt012j_mad/log_pptt012j_mad_d_inl0_hrd0.txt: 24576 events
tmad/logs_pptt012j_mad/log_pptt012j_mad_d_inl0_hrd0.txt: 122880 events
tmad/logs_pptt012j_mad/log_pptt012j_mad_d_inl0_hrd0.txt: 24576 events
tmad/logs_pptt012j_mad/log_pptt012j_mad_d_inl0_hrd0.txt: 98304 events
tmad/logs_ppttjj_mad/log_ppttjj_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ppttjj_mad/log_ppttjj_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ppttjj_mad/log_ppttjj_mad_d_inl0_hrd0.txt: 98304 events
tmad/logs_ppttjj_mad/log_ppttjj_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_ppttjj_mad/log_ppttjj_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 8192 events
tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt: 90112 events

Specifically the problem comes from this snippet in https://github.com/madgraph5/madgraph4gpu/issues/872#issuecomment-2244838905

*** (1) EXECUTE MADEVENT_FORTRAN xQUICK (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 32
32/64
64
 [XSECTION] VECSIZE_USED = 32
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 0.8205 [0.82047507505698292] fbridge_mode=0
 [UNWEIGHT] Wrote 7 events (found 36 events)
 [COUNTERS] PROGRAM TOTAL          :    0.0660s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.0514s
 [COUNTERS] Fortran MEs      ( 1 ) :    0.0145s for      128 events => throughput is 8.81E+03 events/s

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 32/64
 [XSECTION] VECSIZE_USED = 32
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 0.6782 [0.67821465194602271] fbridge_mode=1
 [UNWEIGHT] Wrote 4 events (found 26 events)
 [COUNTERS] PROGRAM TOTAL          :    0.0589s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.0519s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0070s for       96 events => throughput is 1.38E+04 events/s

*** (2-none) Compare MADEVENT_CPP xQUICK xsec to MADEVENT_FORTRAN xsec ***

ERROR! xsec from fortran (0.82047507505698292) and cpp (0.67821465194602271) differ by more than 3E-13 (0.1733878669026967)

The cross sections are different because one comes from 96 events and one from 128. So the real question is why this happens.

I couldn't let loose ;-), here are some observations with I believe a possible path forward.

First thing I realised is that in the failing P* directories (there are multiple) the good helicities are being calculated twice, e.g.

xqcutij # 5>     0.0     0.0     0.0     0.0
 Added good helicity            2  0.76010036098291189       in event            1 local:           1
 Added good helicity            4   4.5619504515741260E-002  in event            1 local:           1
[...]
 Added good helicity           61   4.5641951429909823E-002  in event            1 local:           1
 Added good helicity           63  0.76040877832307874       in event            1 local:           1
 NGOODHEL =          32
 NCOMB =          64
 RESET CUMULATIVE VARIABLE
 MULTI_CHANNEL = TRUE
 CHANNEL_ID =           1
 RESET CUMULATIVE VARIABLE
 Added good helicity            2   2.3895932835526006       in event            1 local:           1
 Added good helicity            4  0.11145961963891361       in event            1 local:           1
 [...]
Added good helicity           61  0.11147986556814252       in event            1 local:           1
 Added good helicity           63   2.3895315191509190       in event            1 local:           1
 NGOODHEL =          32
 NCOMB =          64
 RESET CUMULATIVE VARIABLE
 RESET CUMULATIVE VARIABLE

then when running such a process through the debugger one can see that two different channels and confs are being used for the two respective blocks above, for the first one

Breakpoint 2, smatrix1 (p=..., rhel=0.13348287343978882, rcol=0.30484926700592041, channel=2, ivec=1, ans=1.4682766262148947e-05, ihel=0, icol=0) at matrix1.f:276
276                 PRINT *,'Added good helicity ',I,TS(I)*NCOMB/ANS,' in'
(gdb) bt
#0  smatrix1 (p=..., rhel=0.13348287343978882, rcol=0.30484926700592041, channel=2, ivec=1, ans=1.4682766262148947e-05, ihel=0, icol=0) at matrix1.f:276
#1  0x000000000042fda6 in smatrix1_multi (p_multi=..., hel_rand=..., col_rand=..., channel=2, out=..., selected_hel=..., selected_col=..., vecsize_used=32) at auto_dsig1.f:674
#2  0x0000000000431c10 in dsig1_vec (all_pp=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., all_wgt=..., imode=0, all_out=..., vecsize_used=32) at auto_dsig1.f:544
#3  0x0000000000432e77 in dsigproc_vec (all_p=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=2, iproc=1, imirror=<optimized out>, symconf=..., confsub=..., all_wgt=..., 
    imode=<optimized out>, all_out=..., vecsize_used=<optimized out>) at auto_dsig.f:1042
[...]

and for the second "block"

#0  smatrix1 (p=..., rhel=0.71700626611709595, rcol=0.96330016851425171, channel=1, ivec=1, ans=4.5223594439065952e-08, ihel=1, icol=4) at matrix1.f:276
#1  0x000000000042fda6 in smatrix1_multi (p_multi=..., hel_rand=..., col_rand=..., channel=1, out=..., selected_hel=..., selected_col=..., vecsize_used=32) at auto_dsig1.f:674
#2  0x0000000000431c10 in dsig1_vec (all_pp=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., all_wgt=..., imode=0, all_out=..., vecsize_used=32) at auto_dsig1.f:544
#3  0x0000000000432e77 in dsigproc_vec (all_p=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=<optimized out>, symconf=..., confsub=..., all_wgt=..., 
    imode=<optimized out>, all_out=..., vecsize_used=<optimized out>) at auto_dsig.f:1042
[...]

then I looked into disabling this double execution. I found a variable MIRRORPROCS

cat mirrorprocs.inc 
      DATA (MIRRORPROCS(I),I=1,1)/.TRUE./

setting this to FALSE will reduce the number of found/written events, but the fortran part will still use iconfig=2 in this case. Then tweaking the CI script to also make the cpp code use iconfig=2 will result

in the same number of events found and written
producing a wrong cross section, but the fortran one is exactly half the value of the cpp one.

So probably the modification with MIRRORPROCS is not the right one, but this makes me think we could be on the right path here

I also found several more processes with the same behavior as described above, i.e.

P2_gu_ttxgu 
P2_uux_ttxccx
P2_uux_ttxgg
P2_uux_ttxuux

I guess this could be a way forward, I suggest to check with @oliviermattelaer next week if one can disable this second execution (for testing purposes), he will know in 0 time ;-)

I found a variable MIRRORPROCS
cat mirrorprocs.inc 
      DATA (MIRRORPROCS(I),I=1,1)/.TRUE./

Yes I was looking at that. This is one part of it.

The problem is that cudacpp cannot handle mirror processes and anything with nprocesses>1, I had added huge sanity checks for this but they have been removed in PR #764 following PR #754 (where mirror processes were reenabled).

I have a patch in #935 which is undoing these changes from 754 and 764. I will ask Olivier for review and we can discuss there.

If one sets back MIRRORPROCS=false, which also leads to nprocesses=1, which enables adding back the static assert for nprocesses=1, the test succeeds. See the CI succeeding here https://github.com/madgraph5/madgraph4gpu/pull/935#issuecomment-2247943404

Closing this as fixed. Moving the discussion to PR #935

There is an option to the mg5 commands which will force the fortran generator to produce the same P directories/structure as the cuda one. Maybe this can help in this case.

Hi @roiser, @valassi,

This is certainly good finding (and highly non trivial stuff) This is likely something that I have to understood/investigate. Especially since if this need to be changed, this require to change python, fortran and cpp... (a nightmare ...)

To sum up the situation here:

In this case the two process are "u g > ....." and "g u > .... " Which need to same function to evaluate matrix-element. Personally I count this as nprocess=1 but looks like sometimes python count this as nprocess=2. This is actually the same and they are really nothing special to do on the cpp side to support this. (and the fortran side this is ready/handled correctly IF you have a small warp size --as usual--).

On the fortran side, it is important to know that they are two (since this is fortran that generate the phase-space so it needs to know when/if to flip the initial state momenta or not). While on the matrix-element side, only one of them is needed and this should not be a problem (in principle, clearly I'm wrong somewhere), so it does make sense that we do not have such variable on the C++ side (i.e. this is a phase-space related variable).

Concerning #935, I can take a look but this will not be fixed by "just" reverting something in the plugin. (and this is a case where having the test succeed is not good enough for me).

Now what puzzle me, is why (in #921) I do not see a missmatch between the SIMD version and the LTS version. (That issue is more some note to myself than anything else (I did that on last friday)). One point that I want to test in that issue is the comparison of fortran versus LTS to check if I observe a missmatch there or not. If not then the issue might also be internal to the test.

Is it possible that we have a different handling of the random number between fortran and SIMD but not "physical" missmatch? @valassi, what is the "input" file that you use here? Could we try to increase that number to "realistic" value to see if the missmatch of cross-section is still there? (only suggestion here) I wonder if this is a statistical missmatch that we spot here. Another, related, point to check is the helicity and more precisely, why fortran has two list of helicity, (is the list identical -- should be right,--) and C++ only one.

Sorry to do only brainstorming here (maybe this will help you, but can also wait that I'm back obviously).

Cheers,

Olivier

On 24 Jul 2024, at 15:39, Stefan Roiser @.***> wrote:

There is an option to the mg5 commands which will force the fortran generator to produce the same P directories/structure as the cuda one. Maybe this can help in this case.

— Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/872#issuecomment-2247964508, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535WOAJBIBNJOAGBZZRDZN6VCBAVCNFSM6AAAAABJ7U7S7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBXHE3DINJQHA. You are receiving this because you were mentioned.Message ID: @.***>

Hi @oliviermattelaer

Another, related, point to check is the helicity and more precisely, why fortran has two list of helicity, (is the list identical -- should be right,--) and C++ only one.

In the first code snipped in https://github.com/madgraph5/madgraph4gpu/issues/872#issuecomment-2246965317 above I commented out […] most of the values but I can confirm that the helicity lists are the same for both runs (the values are different) in fortran.

Also, the random number seed is fixed to the same value in “randinit” for both cudacpp and fortran.

Cheers Stefan

Thanks a lot,

Then I guess we could "rename" this issue as "fortran computing twice the helicity for gu_ttxgu while cuda/cpp only once".

Conceptually, they are no real bug here (but we can endlessly discuss about semantic ...): Fortran code has a security "feature" to check helicity filter for each processes even for mirror. (and it was usefull in some rare case but mainly to go around some potential issue in the flitering algorithm).

So I think that this issue can go as low priority now, and the best is likely to fix such issue when we restructure the way helicity filtering is handle (but this is not on our todo list for the moment). At least, I would propose to wait our next meeting to see how we move forward for a short term solution on this.

Thanks a lot, Nice team effort,

Olivier

On 24 Jul 2024, at 22:29, Stefan Roiser @.***> wrote:

Hi @oliviermattelaerhttps://github.com/oliviermattelaer

Another, related, point to check is the helicity and more precisely, why fortran has two list of helicity, (is the list identical -- should be right,--) and C++ only one.

In the first code snipped in #872 (comment)https://github.com/madgraph5/madgraph4gpu/issues/872#issuecomment-2246965317 above I commented out […] most of the values but I can confirm that the helicity lists are the same for both runs (the values are different) in fortran.

Also, the random number seed is fixed to the same value in “randinit” for both cudacpp and fortran.

Cheers Stefan

— Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/872#issuecomment-2248845016, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535VD2M2ZJS5SX2WHU4LZOAFEJAVCNFSM6AAAAABJ7U7S7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYHA2DKMBRGY. You are receiving this because you were mentioned.Message ID: @.***>

Thanks a lot,

Thanks to you Olivier.

Then I guess we could "rename" this issue as "fortran computing twice the helicity for gu_ttxgu while cuda/cpp only once".

NO.

First, this issue is about a crfoss section mismatch, and I would keep it named as it is. Second, and more importantly, I believe helicities are NOT the core issue here.

The issue is that fortran uses mirrorprocs=true for this case and cudacpp does not. Rephrasing: the issue is that fortran internally handles imirror=1,2, while cudacpp assumes nprocesses==1. (And there even used to be a static assert for that as a sanity check, but it was removed... now a line 'nprocesses=2' may percolate into cudacpp, but as the comment lines around it say, this is NOT USED. It is only used for sanity checks, and it was meant to check that this == 1).

Rephrasing again, the question to clarify is whether this is ok (cudacpp can live with nprocesses=1), or instead it is not ok (i.e. we need to implement nprocesses=2 in cudacpp). I put this as astandalone question in #940, hopefully this makes it clearer.

Conceptually, they are no real bug here (but we can endlessly discuss about semantic ...): Fortran code has a security "feature" to check helicity filter for each processes even for mirror. (and it was usefull in some rare case but mainly to go around some potential issue in the flitering algorithm).

Thanks. Yes helicities should also be understood. But my impression is that this is just a side effect of imirror (wild guess, if there is imirror=1 and imirror=2, helicities are computed for both?). Anyway, I think this is a secondary issue.

So I think that this issue can go as low priority now, and the best is likely to fix such issue when we restructure the way helicity filtering is handle (but this is not on our todo list for the moment).

NO, absolutely not.

I think this is really high priority to understand now. Reminder, we have cudacpp and fortran that give different cross sections from one madevent executable now. If we leave things as they are, ANY PHYSICS RESULTS IN PP COLLISIONS ARE WRONG.

Either we clarify that we can solve this by setting nprocesses=1 and mirrorprocs=false also in frortran because we already have both BOTH P2_gu_ttxgu AND P2_gux_ttxgux as SEPARATE directories, i.e. we merge PR #935, OR otherwise we need to implement nprocesses=2 in cudacpp. A placeholder for this is again #940.

At least, I would propose to wait our next meeting to see how we move forward for a short term solution on this.

Yes on this I completely agree, lets discuss on Tuesday. Thanks Andrea

I think this is really high priority to understand now. Reminder, we have cudacpp and fortran that give different cross sections from one madevent executable now. If we leave things as they are, ANY PHYSICS RESULTS IN PP COLLISIONS ARE WRONG.

I am wrong on this one. I made more tests and somehow it looks like the physics results are correct as-is now. That is to say, it seems correct to keep patch #754 (and probably all of #764)

(I still do not understand what is going on. The question about how cudacpp and fortran madevent treat nprocesses=2 and MIRRRORPROCS=TRUE, i.e. #940 or something similar, remains open... and especially the question why this test fails and how it can be fixed: I find it strange, and not desirable, that a madevent executable with the same inputs gives different results when using fortran or cudacpp).

Anyway, here are my tests.

I computed the pp->ttjj cross section under various scenarios.

(1) As a reference, I took a fortran only code generation output madevent. (One detail: I could not use anything derived from the gpucpp branch, because of bug https://github.com/mg5amcnlo/mg5amcnlo/issues/122... so instead I used tag v3.5.5). I got

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent pp_ttjj.madonly --hel_recycling=False --vector_size=32
launch
...
INFO:  Idle: 0,  Running: 0,  Completed: 23 [  2m 19s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 10m47s
...
     Cross-section :   421.1 +- 1.277 pb
     Nb of events :  10000

Ah note an important point: this one has 5 P* subdirectories, not 12. This is because mirror processes are handled in fewer subdirectories.

(2) I then used the plugin version output madevent_simd, ralated to the gpucpp branch. Ah note an important point: this one has 12 P* subdirectories, not 5. This is because mirror processes are in any case expanded in more subdirectories (but then @oliviermattelaer why does it ALSO need MIRRORPROCS=TRUE and nprocesses=2 inside???)

Initially I used the cudacpp that I wanted to merge in PR #935, i.e. CODEGEN as of https://github.com/madgraph5/madgraph4gpu/pull/935/commits/43673b5f1fc44b6820c2a9e74345eee8124c7642

This is the code with nprocesses=1 and MIRRORPROCS=FALSE always. It gives wrong cross sections (the same cross sections for fortran, cuda, cpp).

Cuda (nprocesses=1)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madFcud --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=cuda vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 28 [  3m 15s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 17m17s
...
     Cross-section :   342.1 +- 1.029 pb
     Nb of events :  10000

Fortran (nprocesses=1)... note that this is faster than cuda...?! (as reported by CMS, to be understtod: more events?)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madFfor --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=fortran, default vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 28 [  1m 41s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 10m08s
...
     Cross-section :   342.5 +- 0.9972 pb
     Nb of events :  10000

Cpp (nprocesses=1)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madFcpp --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=cpp512y, default vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 28 [  26.2s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 4m03s
...
     Cross-section :   342.5 +- 0.9966 pb
     Nb of events :  10000

(3) I then used the cudacpp essentially with CODEGEN from upstream/master, i.e. CODEGEN as of https://github.com/madgraph5/madgraph4gpu/pull/935/commits/b685196f5435ffa9800fc38be85527c7c7d76626

This is the code with nprocesses=2 and MIRRORPROCS=TRUE sometimes. It gives correct cross sections ie the same as v3.5.5 (and the same cross sections for fortran, cuda, cpp).

Fortran (nprocesses=2)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madTfor --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=fortran, default vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 28 [  1m 35s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 9m42s
...
     Cross-section :   422.1 +- 1.171 pb
     Nb of events :  10000

Cuda (nprocesses=2)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madTcud --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=cuda vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 34 [  3m 8s  ] 
INFO: Combining runs
...
sum of cpu time of last step: 17m09s
...
     Cross-section :   422.8 +- 1.107 pb
     Nb of events :  10000

Cpp (nprocesses=2)

set stdout_level DEBUG
set zerowidth_tchannel F
define j = p
generate p p > t t~ j j
output madevent_simd pp_ttjj.madTcpp --hel_recycling=False --vector_size=32 
launch
[2 edit runcard -> backend=cpp512y, default vectorsize=32]
...
INFO:  Idle: 0,  Running: 0,  Completed: 34 [  26.9s  ] 
INFO: Combining runs 
...
sum of cpu time of last step: 4m06s
...
     Cross-section :   422.7 +- 1.184 pb
     Nb of events :  10000

(SUMMARY)

Looks like

cudacpp gives the rights results and the same results as fortran for madevent_simd when nprocesses=2 and MIRRORPROCS=TRUE are allowed
but then in that case one individual madeven executable, starting from the same inputs, gives different results and processes different numbers of events... this causes this #872, and personally I think this should be fixed, with high priority, as otherwise it becomes very difficult to make unit tests
as for the explanation, maybe indeed the helicities are one part of the problem, as suggested by @roiser and @oliviermattelaer ... I will look at this tomorrow

In any case I would make whatever change is needed to ensure that one madevent executable gives the same results when starting from the same inputs, whether with cuda or fortran MEs

I am comparing "pp_tt" and "qq_tt" (specifically the uux_ttx therein) as these are simpler processes that are helpful to debug this.

The cross section mismatch happens in pp_tt but not in qq_tt. The two codes are identical except for

MIRRORPROCS=TRUE or FALSE in the two cases (also nprocesses=2 or 1, but this is irrelevant)
LMAXCONFIGS=3 or 1

One thing I noticed is that fortran (with "x10" events in tmad tests) prints out 4 times "RESET CUMULATIVE VARIABLE" (see https://github.com/madgraph5/madgraph4gpu/pull/486#issuecomment-1156071389), while cudacpp only 2. Maybe this i srelated to helicity filtering as was mentioned, maybe something else.

Then I guess we could "rename" this issue as "fortran computing twice the helicity for gu_ttxgu while cuda/cpp only once".

NO.

And I was wrong on this one too. Apologies, Olivier, and thanks again. (But, I would still not rename it).

Adding various debug printouts, Fortran gives in my scripts

*** (1) EXECUTE MADEVENT_FORTRAN x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! ICONFIG number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./madevent_fortran < /tmp/avalassi/input_pptt_x10_fortran > /tmp/avalassi/output_pptt_x10_fortran'
 IMIRROR =           2
 NGOODHEL =           8
 RESET CUMULATIVE VARIABLE in SMATRIX1
 RESET CUMULATIVE VARIABLE
 RESET CUMULATIVE VARIABLE in SAMPLE_FULL
 RESET CUMULATIVE VARIABLE
 IMIRROR =           1
 NGOODHEL =           8
 RESET CUMULATIVE VARIABLE in SMATRIX1
 RESET CUMULATIVE VARIABLE
 RESET CUMULATIVE VARIABLE in SAMPLE_FULL
 RESET CUMULATIVE VARIABLE
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 8
8/16
16

While cuda gives

*** (2-none) EXECUTE MADEVENT_CPP x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! ICONFIG number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_pptt_x10_cudacpp > /tmp/avalassi/output_pptt_x10_cudacpp'
 RESET CUMULATIVE VARIABLE in SMATRIX1_MULTI
 RESET CUMULATIVE VARIABLE
 NGOODHEL =           8
 RESET CUMULATIVE VARIABLE in SAMPLE_FULL
 RESET CUMULATIVE VARIABLE
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 8/16

Now, I was partly right, in the sense that this comes from the fact that nprocesses=2 is not used. While, if we want strict reproducibility between fortran and cuda/cpp (and I most definitely do want this - see the discussion with CMS this morning), then we need to compute the helicities once per imirror also in cudacpp (so the answer to #940, in my opinion, is: yes, we should implement helicities per imirror, so once for each of the 2 nprocesses).

The alternative is to disable this in fortran, ie compute it only once for one imirror. But I guess you'd rather avoid that?

Note by the way, IMIRROR is a common (sigh), COMMON/TO_MIRROR/ IMIRROR,IPROC.

I imagine (I may be wrong) that the reset cuulative variable also is relevant here and causes a different number of events to be processed, which in turn causes the different crss section. I will have a look what can be done.

Ok this seems to be doing the trick... much easier. In practice

I do NOT compute one helicity per mirror in cudacpp: I only compute it once for IMIRROR=1
HOWEVER, I do call RESET_CUMULATIVE_VARIABLE twice... this forces somehow the processing of the same number of events and fixes the test

This is the patch, I need to backport it

commit 7bf806ab505335c2fcfa28626a371a8d4bacdab3 (HEAD -> pptt)
Author: Andrea Valassi <andrea.valassi@cern.ch>
Date:   Fri Jul 26 16:48:18 2024 +0200

    [pptt] in pp_tt.mad P1_uux_ttx, first attempt to fix #872 by calling reset_cumualtive_variable twice

diff --git a/epochX/cudacpp/pp_tt.mad/SubProcesses/P1_uux_ttx/auto_dsig1.f b/epochX/cudacpp/pp_tt.mad/SubProcesses/P1_uux_ttx/auto_dsig1.f
index 0444957f7..6b76c3e29 100644
--- a/epochX/cudacpp/pp_tt.mad/SubProcesses/P1_uux_ttx/auto_dsig1.f
+++ b/epochX/cudacpp/pp_tt.mad/SubProcesses/P1_uux_ttx/auto_dsig1.f
@@ -565,9 +565,12 @@ C
       SAVE NWARNINGS
       DATA NWARNINGS/0/

-      LOGICAL FIRST
+      INTEGER IMIRROR, IPROC
+      COMMON/TO_MIRROR/IMIRROR, IPROC
+
+      LOGICAL FIRST(2)
       SAVE FIRST
-      DATA FIRST/.TRUE./
+      DATA FIRST/.TRUE., .TRUE./

       IF( FBRIDGE_MODE .LE. 0 ) THEN ! (FortranOnly=0 or BothQuiet=-1 or BothDebug=-2)
 #endif
@@ -596,12 +599,16 @@ C
           WRITE(6,*) 'ERROR! The cudacpp bridge only supports LIMHEL=0'
           STOP
         ENDIF
-        IF ( FIRST ) THEN ! exclude first pass (helicity filtering) from timers (#461)
-          CALL FBRIDGESEQUENCE_NOMULTICHANNEL( FBRIDGE_PBRIDGE, ! multi channel disabled for helicity filtering
-     &      P_MULTI, ALL_G, HEL_RAND, COL_RAND, OUT2,
-     &      SELECTED_HEL2, SELECTED_COL2 )
-          FIRST = .FALSE.
-c         ! This is a workaround for https://github.com/oliviermattelaer/mg5amc_test/issues/22 (see PR #486)
+        IF ( FIRST(IMIRROR) ) THEN ! exclude first pass (helicity filtering) from timers (#461)
+          FIRST(IMIRROR) = .FALSE.
+c         Compute helicities only for IMIRROR=1 in cudacpp (see #872)...
+          IF( IMIRROR.EQ.1 ) THEN
+            CALL FBRIDGESEQUENCE_NOMULTICHANNEL( FBRIDGE_PBRIDGE, ! multi channel disabled for helicity filtering
+     &        P_MULTI, ALL_G, HEL_RAND, COL_RAND, OUT2,
+     &        SELECTED_HEL2, SELECTED_COL2 )
+          ENDIF
+c         ... But do call reset_cumulative_variable also for IMIRROR=2 in cudacpp (see #872)
+c         This is a workaround for https://github.com/oliviermattelaer/mg5amc_test/issues/22 (see PR #486)
           IF( FBRIDGE_MODE .EQ. 1 ) THEN ! (CppOnly=1 : SMATRIX1 is not called at all)
             write(*,*) "RESET CUMULATIVE VARIABLE in SMATRIX1_MULTI"
             CALL RESET_CUMULATIVE_VARIABLE() ! mimic 'avoid bias of the initialization' within SMATRIX1
@@ -612,6 +619,7 @@ c         ! This is a workaround for https://github.com/oliviermattelaer/mg5amc_
      &        ' in total number of helicities', NTOTHEL, NCOMB
             STOP
           ENDIF
+          WRITE (6,*) 'IMIRROR =', IMIRROR
           WRITE (6,*) 'NGOODHEL =', NGOODHEL
           WRITE (6,*) 'NCOMB =', NCOMB
         ENDIF

I think this can be closed again. I added a patch in PR #935

function reset_cumulative_variable is called once per mirror
note that helicities are still called once in cudacpp, and twice in fortran (this can be improved, but let's assume there are no inconsistencies)

PS reset_cumulative_variable issues are related to #486 and https://github.com/oliviermattelaer/mg5amc_test/issues/22

I reopen this, just to mark the change of strategy

this will not be fixed by my #935
instead it will be fixed by Olivier's #955 (with minor extensions neded in #986)

madgraph5 / madgraph4gpu

Cross section mismatch in pp_tt012j (P2_gu_ttxgu) in CI tmad tests - reset_cumulative_variable was called twice in fortran and once in cudacpp #872