coding style for helas routine

oliviermattelaer commented 4 years ago

Hi Andrea,

You have modifed the ixxxxxx routine in a way that puzzle me. Could you explain me the interest of adding such lines? cxtype& fi_0 = fis[0]; cxtype& fi_1 = fis[1]; cxtype& fi_2 = fis[2]; cxtype& fi_3 = fis[3]; cxtype& fi_4 = fis[4]; cxtype& fi_5 = fis[5];

Does this speed up the computation? ( I can guess that it does) Does it decrease register? (I guess that this is bad for register no?)

valassi commented 4 years ago

Hi Olivier, yes I guess this may be confusing :-)

The reason why I added this was to be able to test different options in parallel, eg GLOBAL, LOCAL, SHARED. Then however I removed the other options and we only use LOCAL, so now it may be less clear why it is there. See this diff: https://github.com/madgraph5/madgraph4gpu/commit/d1a5097318e80b827d36bc69b12ba8033acb5024#diff-a46ac474dd35c2c37c98db17c70afb67424f007253c0945b35a5cd36eac4ae3b

So, no, it does not speed up computation or reduce registers. It was just a way to support different data layouts in parallel.

Now that we are left with a single LOCAL option, in principle it can be removed (I should cross check everywhere). I do find it quite neat to separate two steps, first how you find the relevant variable from your memory layout, and second what you do with those variables in the computation. This means that if one day we change memory layout, but not computation, we only hav eto change the first part of the code, where the fi_0123 are defined. But maybe you find it confusing actually - and in any case it is probably not necessary.

Similarly, note that the pIparIp4Ievt function now is an overkill. It used to make sense with multiple AOS/SOA formats, but this has also changed and been simplified because we have a single AOSOA option, see this other diff: https://github.com/madgraph5/madgraph4gpu/commit/5be87d07afbd518ecb5c5744e2d567d9fd6719a9#diff-a46ac474dd35c2c37c98db17c70afb67424f007253c0945b35a5cd36eac4ae3b But again, even if we can now remove the pIparIp4Ievt function, I would probably keep in each FVV function the lines defining pvec0123 (and just hardcode there the pIparIp4Ievt implementation, rather than use a function).

About your question whether this actually causes a performance overhead, I think that it does not. And I think that I did try to have a look - but I should test this explicitly again. These are all references, or even constant references. Maybe I do not understand this completely, but I would imagine that the memory overhead is some memory addresses (managed in the executable/library), not some memory contents (stored in registers - and spilled out to global memory if registers are not enough).

So, my summary would be

This was necessary to test AOS and GLOBAL options. Now it is no longer necessary but I would keep it. (But of course if you prefer to remove that, or it's easier, just go ahead!).
I would still test explicitly that it does not cause a performance overhead in terms of registers, but I think it does not
We could remove the pIparIp4Ievt function and inline that in the definition of the pvec0123

Does this clarify? And what do you think of the options? Thanks Andrea

valassi commented 4 years ago

PS Lets also put it this way: I find it highly unlikely, but it is not completely excluded, that at some point we experiment again with GLOBAL and SHARED. For instance when we test streams or graphs again? In that case, separating the memory access to the variable from the calculation involving the variable seems like a good idea to me, as long as it has no performance overhead.

oliviermattelaer commented 4 years ago

Thanks Andrea this makes more sense.

Then I will move forward first with a version where I do not focus on that specifically and pick the simplest solution for me (maybe not even fully consistent within the full code). Then when we will have a working solution, we can look back at this and clean in one direction or the other.

valassi commented 4 years ago

I removed this from sampling (not relevant there). Added this to vectorisatrion (I will explain why it is relevant tomorrow). Actually this is needed in a variety of contexts, for instance also to provide C++/CUDA single source (which does not need to be a project, but is an overarching theme)

valassi commented 3 years ago

Hi @oliviermattelaer I am reviewing old tickets and I will close this one.

The peculiar style of coding that you asked about an dthat I had initially implemented has now disappeared.

The current code, for instance ggttgg epochX4, is much simpler, examples

Closing

madgraph5 / madgraph4gpu

coding style for helas routine #57