madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Closed valassi closed 6 days ago

valassi commented 1 week ago

This is a WIP PR to allow no-OpenMP builds (#758) and to debug intermittent FPEs (#845).

This is starting from MR #873 which fixes color mappings and related issues.

The reason why OMP and FPEs are related is that I had some indication that the FPE crash is intermittent with OMP, but happens all the time without OMP. So disabling OMP might trigger the issue more often. See https://github.com/madgraph5/madgraph4gpu/pull/873/commits/f1e0d42349a97574647cd09b088b2092cfe4ee2b

valassi commented 1 week ago

Note: even after disabling OMP in the CI tests, issue #845 has not shown up. The only 6 usual errors are from #826 and #872.

I think that I understand this better now: this issue ONLY happens for AVX512 (and specifically 512z with FPTYPE=f). Our CI on standard github nodes does not offer AVX512 (and it would be an overkill to set up our own nodes just for that). So I will debug this in manual tests.

valassi commented 1 week ago

Hi @oliviermattelaer yet another one (we are getting there...).

I have fixed this intermittent crash #845 again by using 'volatile'. This time it should be less controversial I hope. The crash happens deep in SIMD code and only in specific configurations like 512z and fptype=f. So I am quite confident that disabling optimizations with volatile makes sense. Note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

En passant I added the option (that you had asked for in #758) to disable OpenMP. You just need to export OMPFLAGS= (set an empty string: this is different from unsetting it).

Can you please review? NB This is based on #873, so I would FIRST review and merge that, and THEN review this (so it becomes easier to see only the additional differences).

Thanks! Andrea

valassi commented 1 week ago

I have rerun the manual tests. This is now ready to be merged (but it would be best to first agree on and merge https://github.com/mg5amcnlo/mg5amcnlo/pull/115 and #873)

valassi commented 1 week ago

Putting this back to draft for the moment until we agree on #873 and #877

valassi commented 1 week ago

I have merged the latest upstream/master (including PR #852 and PR #877) into this and marked it again as ready for review.

valassi commented 6 days ago

Hi @oliviermattelaer I have updated this again with upstream/master, it is ready to be merged

Can you please review? (when you have time, not urgent)

Thanks Andrea

oliviermattelaer commented 6 days ago

Perfect to be merged (but I would like to keep #758 open, since I think that we should have an agreement on how we handle OpenMP by default from the madgraph5. (i.e. should I set OMPFLAGS= by default or not, should we have an entry in the run_card to control that /...) -> will copy-paste this message in #758 (since this is the proper place to discuss that).

Thanks a lot,

Olivier

valassi commented 6 days ago

Very good, thanks Olivier!

I will merge this then, and reopen #758

Note, there are 6 errors in the CI but they are the usual failing ones image