WIP: merge of master into master_june24 (for the moment: add master CI to master_june24, identify/fix some issues) - Githubissues

madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package

28 stars 33 forks source link

WIP: merge of master into master_june24 (for the moment: add master CI to master_june24, identify/fix some issues) #882

Open valassi opened 6 days ago

valassi commented 6 days ago

Hi @oliviermattelaer this is a WIP PR to start working towards the resync of master and master_june24. From what I understand this is one of the things you want to push with high priority.

This is constructed as a merge into master_june24. That is to say, I start from what you and Stefan had in master_june24 (as a result of Stefan's channelid PR #830, related to the warp issue #765), and I start porting a few of the master stuff, rather than going the opposite way. This allows me to go in steps with things I know (the various steps in master).

For the moment, here I am just merging the latest master CI (with tmad tests) into master_june24. Since the CI is enabled also for master_june24, I expect that the new tests should run, and the results may be interesting.

Speaking of which, @roiser @oliviermattelaer, how did you test the code in master_june24?

am I supposed to use a different input.txt file to pipe to madevent to specify a range of iconfig's, or will the current one with a single iconfig value be enough?
if I am supposed to use the same input.txt with a single iconfig (by looking at driver.f which has not changed I would guess this is the case), can you confirm that the code will still test the new functionality you have created and have a channelid array with different values, or will this result in a channelid array which all have the same value?
(@oliviermattelaer for my information, not directly or immediately relevant for tests: is the madevent fortran/python/bash infrastructure to orchestrate fewer G* jobs with many channels per job complete, or is this still under development?)
(and also for my information if I should have issues in the code: do I remember correctly that a channelID array eg of 32k channels will be segmented such that inside each 32-channel warp the channelid is the same, but different warps can have different channelids? or did you eventually modify the logic of this?)

Thanks, Andrea

PS For the context: master_june24 mainly differs from master because of the addition of Stefan's channelid MR #830 which is connected to Olivier's warp work in #765

valassi commented 6 days ago

There are 49 errors in the CI. I opened #883 #884 #885

valassi commented 5 days ago

I am trying to fix issues in MG5AMC. Will do a force push and file an issue

valassi commented 5 days ago

I have tried to upgrade MG5AMC from the current eef200f94 to gpucpp_june24, but this fails codegen #886. I have reverted.

I will instead create a branch where I merge gpucpp on top of the eef200f94 which is currently in master_june24.

valassi commented 5 days ago

This is annoying. I upgraded MG5AMC including the rotxxx fix #857 that I used for the crash #855. This has NOT fixed the CI crash of madevent in all CI tests #885. I will need to use a debug build with gdb.

There are still 49 failing tests.

valassi commented 5 days ago

I have fixed a minor typo in unit_v for MAC #883. Marking it as fixed.

Now there are only 45 CI errors instead of 49

valassi commented 5 days ago

I have fixed another minor issue #884 failing tghe builds for FPTYPE=m. There was one line forgotten from a previous implementation, it should have been removed for FPTYPE=m and was not.

Now down from 45 to 39 errors, all related to #885 crashes in the new CI tests I think. The old CI tests are now all succeeding.

valassi commented 5 days ago

I investigated #885 and found that the crash only happens when setting VECSIZE_USED different from VECSIZE_MEMMAX. In the CI in my initial tests VECSIZE_MEMMAX was 16384 and VECSIZE_USED was 32, so this crashed.

Looking more into that, I realised that I was not sure what parameters I should use for NB_WARP and WARP_SIZE. This is discussed in #887. I gave it a try to use NB_WARP=512 with WARP_SIZE=32 ie VECSIZE_MEMAMX=16384. With VECSIZE_USED=32, this still crashes in #885. But in addition I also get a Fortran runtime error in symconf #888.

valassi commented 5 days ago

I added a workaround (NOT a fix) for crash #885 just to allow the CI tests to proceed further. Essentially I put down NB_WARP=1 and WARP_SIZE=32 so that VECSIZE_MEMAMX=32 is the same as VECSIZE_USED=32. This avoids the crash (but avoids testing anything interesting in the new warp infrastructure, making it pointless). A proper fix for #885 (and for #888) is needed.

Apart from other 'expected' failures, there is a xsec mismatch for ggttggg #889. This can be fixed by increasing tolerance

There are ten failures overall. The other 9 are the usual #826 and #872 (pensing in master) and #856 (fixed in master, to be merged here).

valassi commented 5 days ago

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

valassi commented 5 days ago

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

valassi commented 4 days ago

Hi @oliviermattelaer as you see I made some progress here, but I am putting this work on hold.

I need some answers on #887 (and it is possible that without VECSIZE_USED I go nowhere and I need to wait for that).

(Or at least: probably I will continue merging bits of master into master_june24 so that we avoid a complete divergence, but I would argue against merging back master_june24 into master until many of these issues are fixed... there are just too many things that seem to not have been tested)