madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

MAJOR ISSUE: color mismatch fortran/cpp in LHE file for iconfig 104 in SM gg_ttgg (channel/iconfig mapping AND icolamp issues) #856

Closed valassi closed 3 months ago

valassi commented 4 months ago

This is a followup to #855.

If I fix the SIGFPE in rotxxx by adding volatile in fortran code, the crash is avoided but then there is an LHE file mismatch

 ./tmad/madX.sh -ggttgg -iconfig 104
...
*** (2-none) Compare MADEVENT_CPP x1 xsec to MADEVENT_FORTRAN xsec ***

OK! xsec from fortran (0.46320556621222242) and cpp (0.46320556621222236) differ by less than 3E-14 (1.1102230246251565e-16)

*** (2-none) Compare MADEVENT_CPP x1 events.lhe to MADEVENT_FORTRAN events.lhe reference (including colors and helicities) ***
ERROR! events.lhe.cpp.1 and events.lhe.ref.1 differ!
diff /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/events.lhe.cpp.1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/events.lhe.ref.1 | head -20
6,8c6,8
<          -6    1    1    2    0  503  0.18965250326E+03 -0.37597274505E+02  0.12649008736E+03  0.28863535688E+03  0.17300000000E+03 0.  1.
<          21    1    1    2  504  501  0.62170885397E+02  0.36618395894E+02  0.31153079182E+02  0.78591604204E+02  0.00000000000E+00 0.  1.
<          21    1    1    2  505  504  0.17333851786E+01  0.11630357128E+03  0.45398068655E+02  0.12486196360E+03  0.00000000000E+00 0.  1.
---
>          -6    1    1    2    0  504  0.18965250326E+03 -0.37597274505E+02  0.12649008736E+03  0.28863535688E+03  0.17300000000E+03 0.  1.
>          21    1    1    2  504  503  0.62170885397E+02  0.36618395894E+02  0.31153079182E+02  0.78591604204E+02  0.00000000000E+00 0.  1.
>          21    1    1    2  505  501  0.17333851786E+01  0.11630357128E+03  0.45398068655E+02  0.12486196360E+03  0.00000000000E+00 0.  1.
20c20
<          21   -1    0    0  501  503 -0.00000000000E+00 -0.00000000000E+00 -0.12305922681E+04  0.12305922681E+04  0.00000000000E+00 0.  1.
---
>          21   -1    0    0  502  503 -0.00000000000E+00 -0.00000000000E+00 -0.12305922681E+04  0.12305922681E+04  0.00000000000E+00 0.  1.
22c22
<          -6    1    1    2    0  502 -0.16776755257E+03 -0.12342442113E+03 -0.43168412413E+03  0.50956817253E+03  0.17300000000E+03 0.  1.
---
>          -6    1    1    2    0  504 -0.16776755257E+03 -0.12342442113E+03 -0.43168412413E+03  0.50956817253E+03  0.17300000000E+03 0.  1.
24c24
<          21    1    1    2  505  504  0.14318120879E+02  0.15600982705E+02 -0.82469087380E+02  0.85144287067E+02  0.00000000000E+00 0. -1.
---
>          21    1    1    2  505  501  0.14318120879E+02  0.15600982705E+02 -0.82469087380E+02  0.85144287067E+02  0.00000000000E+00 0. -1.

I think that this is due to the issue identified by Olivier in WIP PR #852, namely channel/iconfig mapping issues.

But this is an INDEPENDENT issue from the SIGFPE crash (even ifit is bizarre that both happen only for specific iconfig choices).

valassi commented 4 months ago

I mark this as a major issue and pin it, because this affects all SM results in user code. As pointed out by Olivier in PR #852, there is a problem in cudacpp with iconfig-channel mappings. IMO this must be fixed before the release.

valassi commented 3 months ago

This issue #856 (LHE color mismatch in gg_ttgg for iconfig=104) can now be reproduced in the CI if rotxxx is fixed. In the PR #857 which fixes rotxxx with volatile, I configured the tmad test for gg_ttgg to use iconfig=104. I get the following failure https://github.com/madgraph5/madgraph4gpu/actions/runs/9698902245/job/26766683190

*** (2-none) Compare MADEVENT_CPP xQUICK events.lhe to MADEVENT_FORTRAN events.lhe reference (including colors and helicities) ***
ERROR! events.lhe.none.QUICK and events.lhe.ref.QUICK differ!
diff /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/events.lhe.none.QUICK /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/events.lhe.ref.QUICK | head -20
4c4
<          21   -1    0    0  501  503 -0.00000000000E+00 -0.00000000000E+00 -0.33445071051E+03  0.33445071051E+03  0.00000000000E+00 0.  1.
---
>          21   -1    0    0  502  503 -0.00000000000E+00 -0.00000000000E+00 -0.33445071051E+03  0.33445071051E+03  0.00000000000E+00 0.  1.
6c6
<          -6    1    1    2    0  502 -0.96800713603E+02  0.45396286052E+02  0.12624284574E+03  0.23936887233E+03  0.17300000000E+03 0.  1.
---
>          -6    1    1    2    0  504 -0.96800713603E+02  0.45396286052E+02  0.12624284574E+03  0.23936887233E+03  0.17300000000E+03 0.  1.
8c8
<          21    1    1    2  505  504 -0.56352823282E+01 -0.25774621670E+02 -0.66514459789E+01  0.27208992315E+02  0.00000000000E+00 0.  1.
---
>          21    1    1    2  505  501 -0.56352823282E+01 -0.25774621670E+02 -0.66514459789E+01  0.27208992315E+02  0.00000000000E+00 0.  1.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tmad_test (gg_ttgg.mad) finished with status=1 (NOT OK) at Thu Jun 27 15:00:19 UTC 2024
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tmad_test (gg_ttgg.mad) FPTYPE=d: issue will not be bypassed, test has FAILED

image

valassi commented 3 months ago

A tentative fix for this issue is in https://github.com/mg5amcnlo/mg5amcnlo/pull/116 This was initially meant to be merged to madgraph4gpu in PR #877, but this will not happen. An equivalent patch in cudacpp was also developed as an alternative, but will also not be enabled in 877. (The association to #877, which had been added at some point, has therefore been removed).

This remains a major pending issue in my opinion.

valassi commented 3 months ago

Marking as reopened in the sense that it was NOT fixed in #877.

valassi commented 3 months ago

Marking as closed because it was fixed by Olivier in #880.

valassi commented 3 months ago

Unpinning the issue as this was finally fixed