madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

BSM UFO models: ME mismatch between HRDCOD=0 and HRDCOD=1 in EWdim6 u d~ to w+ z #846

Open valassi opened 1 month ago

valassi commented 1 month ago

Hi @zeniheisser , as discussed I have checked that my recent SUSY/HEFT/SMEFT patches make it possible to code generate EWdim6 u d~ to w+ z, thus fixing what you reported in #615.

The code also builds and runs. However I found a mismatch in the calculation of the ME between HRDCOD=0 and HRDCOD=1 (within runTest.exe in my "tput" test suite). If I use a reference file created with HRDCOD=1, then the test fails for HRDCOD=1, and I guess viceversa. I am not sure which one is correct. I will later on also compare to fortran.

The error is for instance the following:

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/ewdim6_ud_wz.mad/SubProcesses/P1_udx_wpz> ./build.none_d_inl0_hrd1/runTest.exe
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
INFO: The application does not require the host to support any AVX feature
[==========] Running 6 tests from 6 test suites.
[----------] Global test environment set-up.
...
[ RUN      ] SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_EWdim6_udx_wpz.txt
MadgraphTest.h:311: Failure
The difference between testDriver->getMatrixElement( ievt ) and referenceData[iiter].MEs[ievt] is 17709.875596685633, which exceeds toleranceMEs * referenceData[iiter].MEs[ievt], where
testDriver->getMatrixElement( ievt ) evaluates to 17888.767727190199,
referenceData[iiter].MEs[ievt] evaluates to 178.89213050456669, and
toleranceMEs * referenceData[iiter].MEs[ievt] evaluates to 0.00017889213050456667.
Google Test trace:
MadgraphTest.h:289: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.500000000000000e+02  5.849331413473451e+02 -3.138365726669762e+02 -3.490842674916368e+02
ref2  7.500000000000001e+02  5.849331413473452e+02 -3.138365726669762e+02 -3.490842674916370e+02

   3  7.500000000000002e+02 -5.849331413473450e+02  3.138365726669762e+02  3.490842674916368e+02
ref3  7.499999999999999e+02 -5.849331413473452e+02  3.138365726669762e+02  3.490842674916369e+02

  ME  1.788876772719020e+04
r.ME  1.788921305045667e+02

INFO: No Floating Point Exceptions have been reported
[  FAILED  ] SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x129bc70 (8 ms)
[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest (8 ms total)

The same failure happens for FPTYPE=d,f,m.

valassi commented 1 month ago

Note1: the tmad tests comparing to Fortran succeed (for HRDCOD=0). So I would guess that it is HRDCOD=1 that gives wrong results?

Note2: however, in PR #847, the CI actually fails the test for HRDCOD=0 (which should normally succeed??). So there is clearly something to be understood here.

https://github.com/madgraph5/madgraph4gpu/actions/runs/9114905159/job/25059969417?pr=847

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tput_test (ewdim6_ud_wz.sa) starting at Thu May 16 15:21:05 UTC 2024
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Current directory is /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/ewdim6_ud_wz.sa

*******************************************************************************
*** tput-test ewdim6_ud_wz.sa (P1_Sigma_EWdim6_udx_wpz)
*******************************************************************************

Testing in /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/ewdim6_ud_wz.sa/SubProcesses/P1_Sigma_EWdim6_udx_wpz

Execute build.none_d_inl0_hrd0/runTest.exe
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
INFO: The application does not require the host to support any AVX feature
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU_XXX
[ RUN      ] SIGMA_EWDIM6_UDX_WPZ_CPU_XXX.testxxx
[       OK ] SIGMA_EWDIM6_UDX_WPZ_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU_MISC
[ RUN      ] SIGMA_EWDIM6_UDX_WPZ_CPU_MISC.testmisc
[       OK ] SIGMA_EWDIM6_UDX_WPZ_CPU_MISC.testmisc (5 ms)
[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU_MISC (5 ms total)

[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest
[ RUN      ] SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_EWdim6_udx_wpz.txt
MadgraphTest.h:311: Failure
The difference between testDriver->getMatrixElement( ievt ) and referenceData[iiter].MEs[ievt] is 17709.875596685673, which exceeds toleranceMEs * referenceData[iiter].MEs[ievt], where
testDriver->getMatrixElement( ievt ) evaluates to 17888.767727190239,
referenceData[iiter].MEs[ievt] evaluates to 178.89213050456669, and
toleranceMEs * referenceData[iiter].MEs[ievt] evaluates to 0.00017889213050456667.
Google Test trace:
MadgraphTest.h:289: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.500000000000000e+02  5.849331413473451e+02 -3.138365726669762e+02 -3.490842674916368e+02
ref2  7.500000000000001e+02  5.849331413473452e+02 -3.138365726669762e+02 -3.490842674916370e+02

   3  7.500000000000002e+02 -5.849331413473450e+02  3.138365726669762e+02  3.490842674916368e+02
ref3  7.499999999999999e+02 -5.849331413473452e+02  3.138365726669762e+02  3.490842674916369e+02

  ME  1.788876772719024e+04
r.ME  1.788921305045667e+02

INFO: No Floating Point Exceptions have been reported
[  FAILED  ] SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x56001c55a510 (7 ms)
[----------] 1 test from SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest (7 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (13 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SIGMA_EWDIM6_UDX_WPZ_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x56001c55a510

 1 FAILED TEST

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tput_test (ewdim6_ud_wz.sa) finished with status=1 (NOT OK) at Thu May 16 15:21:05 UTC 2024
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Error: Process completed with exit code 1.
valassi commented 1 month ago

And, even more bizarre, the .mad tests succeed in the same CI, it is only the .sa tests that fail. (This was normally the same code I think?). So again something to be debugged...