master_june24 tmad tests crashes in dsig1_vec (CUDACPP_RUNTIME_VECSIZEUSED not correctly propagated?)

valassi commented 3 months ago

Another issue introduced in #830 and being reviewed in #882.

Related to the tests in WIP PR https://github.com/madgraph5/madgraph4gpu/pull/882:

master_june24 tmad tests crashes

https://github.com/madgraph5/madgraph4gpu/actions/runs/9795940531/job/27049356578#step:12:77

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7fef87623960 in ???
#1  0x7fef87622ac5 in ???
#2  0x7fef8724251f in ???
#3  0x55d65fd47542 in dsig1_vec_
#4  0x55d65fd485a9 in dsigproc_vec_
#5  0x55d65fd49352 in dsig_vec_
#6  0x55d65fd5ef5e in sample_full_
#7  0x55d65fd44d12 in MAIN__
#8  0x55d65fd1c6de in main
.github/workflows/testsuite_oneprocess.sh: line 289:  4276 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_f_inl0_hrd0/madevent_cpp < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum/input_ee_mumu_none > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum/output_ee_mumu_none' failed
 Renormalization scale set on event-by-event basis
 Factorization   scale set on event-by-event basis

I think this is the rotxxx issue, I will just cherry-pick those fixes to start with

valassi commented 3 months ago

OK I thought this looked like the rotxxx I know, but I started to suspect that this is also a NEW bug introduced by master_june40.

The reason is that with default vecsize used this succeeds

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
CUDACPP_RUNTIME_VECSIZEUSED=16384 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp

But with smaller values it crashes. I guess CUDACPP_RUNTIME_VECSIZEUSED is not well propagated/handled somewhere?

 CUDACPP_RUNTIME_VECSIZEUSED=32 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
...
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f3449023860 in ???
#1  0x7f3449022a05 in ???
#2  0x7f3448c54def in ???
#3  0x42d52a in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f:409
#4  0x42e589 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f:1059
#5  0x42f340 in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f:331
#6  0x4451ce in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/Source/dsample.f:229
#7  0x42ada8 in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f:257
#8  0x4026ce in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f:302
Floating point exception (core dumped)

Through gdb

CUDACPP_RUNTIME_VECSIZEUSED=32 gdb ./build.none_d_inl0_hrd0/madevent_cpp 
...
(gdb) run  < /tmp/avalassi/input_ggtt_x1_cudacpp
...
Program received signal SIGFPE, Arithmetic exception.
0x000000000042d52a in dsig1_vec (all_pp=<error reading variable: value requires 2097152 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, symconf=..., confsub=..., iconf_vec=..., 
    imirror_vec=..., vecsize_used=32) at auto_dsig1.f:409
409                 R=R-DABS(ALL_PD(IPSEL,IVEC))/ALL_PD(0,IVEC)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x000000000042d52a in dsig1_vec (all_pp=<error reading variable: value requires 2097152 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, symconf=..., confsub=..., iconf_vec=..., 
    imirror_vec=..., vecsize_used=32) at auto_dsig1.f:409
#1  0x000000000042e58a in dsigproc_vec (all_p=..., 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf_vec=..., iproc=1, imirror_vec=..., 
    symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=32) at auto_dsig.f:1059
#2  0x000000000042f341 in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf_vec=..., iproc=1, imirror_vec=..., 
    all_out=..., vecsize_used=32) at auto_dsig.f:331
#3  0x00000000004451cf in sample_full (ndim=4, ncall=8192, itmax=1, itmin=1, dsig=0x42f4a0 <dsig>, ninvar=4, nconfigs=1, vecsize_used=32)
    at dsample.f:229
#4  0x000000000042ada9 in driver () at driver.f:257
#5  0x00000000004026cf in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
#6  0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#7  0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#8  0x00000000004027f5 in _start ()
...
(gdb) p all_pd
value requires 262144 bytes, which is more than max-value-size
(gdb) p ivec
$1 = 33

Note the index=33 while we only use 32. This looks like the root cause of the issue.

These are all with debug builds adding -g for gg_tt.mad

oliviermattelaer commented 2 months ago

In the fortran side, I have dropped all those "USED" concept since this was complexifying the handling quite a lot (which was complex enough).

Now, I was not aware of that variable "CUDACPP_RUNTIME_VECSIZEUSED" and I'm not aware that we setup such environment variable for the code (did I miss this? a grep fail to find such variable it in either GPUCPP or the PLUGIN directory --did not check the full madgraph4gpu repo).

In any case, what is the value that you did use for the fortran side? (for the nb_warp, and warp_size)? The code will be compiled for those exact value and acts as an effictive maximum, you can then decrease such value without a need of a recompilation, but if you increase it, then the code goes need recompilation (which is done automatically when running from madgraph interface --like a normal user--).

So the question is how much this is important for you.

Cheers,

Olivier

valassi commented 2 months ago

In the fortran side, I have dropped all those "USED" concept since this was complexifying the handling quite a lot (which was complex enough).

So the question is how much this is important for you.

Hi Olivier, thanks.

For me this is now essential. All my tmad tests need this (ie the comparisons of cross sections and LHE files).

It is relatively trivial to implement this, I am doing this now. (This is one bullet in your #765). I will include this in #882

NB: I also dislike this. we discussed this in the past, there are two ways I see to avoid this

One #501, use malloc-like dynamic allocation in fortran (this looks interesting, but may take a long time)
Two #502, use separate build directories for gpu and cpp also for fortran files

After we complete this saga of master june24, I think that implementing the second option is doable. But not now. I will just implement nb_warp_used instead.

My use case and need is, allow the possibility of VECSIZE_MEMMAX=16384 or another large number for CUDA to test the maximum speed, but then be able to use the same builds for short fast tests (where speed is not important, but functionality is). And in particular if I run c++ tests with 16384 some may take weeks...

(Rephrase: for me testing is the most important part of development - I need this for testing)

Thanks Andrea

valassi commented 2 months ago

This is now fixed here https://github.com/mg5amcnlo/mg5amcnlo/compare/6c7fda883a5d431bd402278f6b66693d93eb04ab...2f4c46765cbc1a71e8d0d585288e123a2dc670bc#diff-dde69d2cab5a5b3aacdbba8584f63325ee525e3af44bc8c8c74db256d0e32915

At some point I will make a MR on gpucpp (this is in a private branch valassi_gpucpp_june24)

The CI confirms that the crashes have disappeared https://github.com/madgraph5/madgraph4gpu/pull/882#issuecomment-2233805338

Closing

oliviermattelaer commented 2 months ago

Commenting on the use of nb_warp_used.

This also requires some change in the upstream madgraph (via banner.py) in order to have some real support also on the fortran side and to have the correct logic to recompile the code when vecsize_used is increased but not decrease/...

So please to a separate PR for that specific point since just that commit is in itself problematic. Or at least push that branch somewhere where I can fix the stuff in the way, I do like.

Concerning dynamical allocation, zenny has now experience on how to do that actually, so this is also something that I/he can do.

valassi commented 2 months ago

Hi Olivier, thanks for the comments :-)

Listen can we also split this in two parts, I mean open a new issue about 'adding nb_warp_used support in banner.py etc'?

This specific issue here about segmentation faults is solved. So maybe my patch is not a complete patch, but it is a partial patch that works (and allows me to go forward on things where otherwise I would be blocked). So I completely agree that there is more work to do (there is always more work to do), but can we already include this patch?

Anyway lets discuss tomorrow. Thanks! Andrea

oliviermattelaer commented 2 months ago

Please include that set of patch in a dedicated branch within the madgraph repo/fork and I will adapt my handling accordingly and then we can merge this.

This way you will not be blocked and I will not be blocked either

Cheers,

Olivier

On 17 Jul 2024, at 23:50, Andrea Valassi @.***> wrote:

Hi Olivier, thanks for the comments :-)

Listen can we also split this in two parts, I mean open a new issue about 'adding nb_warp_used support in banner.py etc'?

This specific issue here about segmentation faults is solved. So maybe my patch is not a complete patch, but it is a partial patch that works (and allows me to go forward on things where otherwise I would be blocked). So I completely agree that there is more work to do (there is always more work to do), but can we already include this patch?

Anyway lets discuss tomorrow. Thanks! Andrea

— Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/885#issuecomment-2234382144, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535QUARFCKAAWHBNHEOTZM3RKVAVCNFSM6AAAAABKLVRAD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZUGM4DEMJUGQ. You are receiving this because you modified the open/close state.Message ID: @.***>

valassi commented 2 months ago

Please include that set of patch in a dedicated branch within the madgraph repo/fork and I will adapt my handling accordingly and then we can merge this. This way you will not be blocked and I will not be blocked either Cheers, Olivier On 17 Jul 2024, at 23:50, Andrea Valassi @.***> wrote: Hi Olivier, thanks for the comments :-) Listen can we also split this in two parts, I mean open a new issue about 'adding nb_warp_used support in banner.py etc'? This specific issue here about segmentation faults is solved. So maybe my patch is not a complete patch, but it is a partial patch that works (and allows me to go forward on things where otherwise I would be blocked). So I completely agree that there is more work to do (there is always more work to do), but can we already include this patch? Anyway lets discuss tomorrow. Thanks! Andrea

Thanks Olivier :-)

Two comments.

First, about my NB_WARP_USED patches, I created two PRs from my dedicated branch, one against gpucpp_june24 https://github.com/mg5amcnlo/mg5amcnlo/pull/120 and one against gpucpp https://github.com/mg5amcnlo/mg5amcnlo/pull/121. The former however is blocked by #886, ie gpucpp_june24 is unusable with master_june24 (and this is independent of my later work). Could you maybe look at 886 please? Otherwise, 121 is good for me, but I would merge it only when we merge 882 into master, hence it is in WIP now.

Second, I opened #928 about adding nb_warp_used in banner.py or whatever else needs to be done still. Can we agree to close this one 885 (the segfault is fixed) and keep the discussion in 928 after merging master_june24 and 882 into master?

Thanks Andrea

oliviermattelaer commented 1 month ago

Ok was trying to understand the state with the _used stuff and get back here... But yes, let's close this thread and (re)discuss this _used stuff somewhere else (#928 maybe or on one of your PR)

valassi commented 1 month ago

Ok was trying to understand the state with the _used stuff and get back here... But yes, let's close this thread and (re)discuss this _used stuff somewhere else (#928 maybe or on one of your PR)

Thanks @oliviermattelaer ! Ok so lets consider this closed, I am working towards merging #882.

madgraph5 / madgraph4gpu

master_june24 tmad tests crashes in dsig1_vec (CUDACPP_RUNTIME_VECSIZEUSED not correctly propagated?) #885