restructure magnetism calculation so matrix product is the same for front and back reflectivity

pkienzle commented 3 years ago

The Cr4xa code has a switch which changes the optical matrix calculation from the forward direction to the reverse direction when kz < 0. If possible, restructure the equation so that the matrix calculation is always in the forward direction. This should simplify the code and allow for easier data parallelism in the calculation, with the layer parameters shared for all kz in the same block.

bmaranville commented 3 years ago

This seems like it will be straightforward... though reversing time always comes with risks.

bmaranville commented 3 years ago

It seems like the results are (for the reversed direction)

r++ = (B12 B44 - B14 B42) / (detw)
r+- = (B32 B44 - B34 B42) / (detw)
r-+ = (B14 B22 - B12 B24) / (detw)
r-- = (B34 B22 - B32 B24) / (detw)

where detw == (B44 B22 - B42 B24) as before. It turns out it is the same as for the forward direction, with the indices reversed. I can't guarantee that the SF cross sections won't be switched.

bmaranville commented 3 years ago

It's more complicated than this of course... there are assumptions baked into the function like that some elements are zero in the first layer (which now sometimes is the last layer...) Also I think we need to reverse the U1 and U3 matrices, as they are calculated in the forward direction (aligned with the layers).

bmaranville commented 3 years ago

The fact that we calculate the wavefunctions in the layers relative to E0 = KZ*KZ + PI4*(RHO[fronting]+IP*RHOM[fronting]) makes me wonder if this effort even means anything - we have to recalculate the whole stack if the SLD of the incident medium is changing, so how are we saving any time? Unless the fronting and backing medium have exactly the same SLD and magnetic SLD, this doesn't help at all.

bmaranville commented 3 years ago

@pkienzle can we close this one? I think it's a dead end.

pkienzle commented 3 years ago

SIMD: if neighbouring kz values are using the same layer values then you can prefetch them in parallel then apply them across every kz in the warp, with different E0 for each kz. And the indexing becomes simpler which makes it easier for the optimizer to recognize the patterns and perhaps do some automatic loop unrolling.

A quick experiment (replacing the indexing with simple indexing on the existing code) didn't make a noticeable difference in the speed so not worth spending much time right now. For a fast implementation on GPU it may help, but we aren't there yet.

Sometimes shorter code can help the compiler and processor. I saved a little time (5%) by simplifying the B initialization, using the following:

    Z = 0.
    S1LP = -sqrt(complex(PI4*(RHO[L]+RHOM[L]) - E0, -PI4*(fabs(IRHO[L])+EPS)))
    S3LP = -sqrt(complex(PI4*(RHO[L]-RHOM[L]) - E0, -PI4*(fabs(IRHO[L])+EPS)))
    BLP = GLP = 0j
    B11 = B22 = B33 = B44 = 1. + 0j
    B12 = B13 = B14 = B21 = B23 = B24 = B31 = B32 = B34 = B41 = B42 = B43 = 0j
    for I in range(0,N-1):
        ...

Note: I didn't check that the result was correct; I only wanted to see if it would be faster.

This may be worth doing to make the code easier to maintain. Fewer repeated expressions. I'll let you decide.

pkienzle commented 3 years ago

The second comment belongs with the PR. I'll move it.

bmaranville commented 3 years ago

Ah - I see. So would it be fair to say this is more about speeding up the calculation in the forward direction than trying to be clever and use the same matrix for the back direction without recalculating? It naively seems like there is much more to gain by optimizing in one direction and just applying "reverse" to the SLD stack when "back reflectivity" is requested, which I have to imagine is not all that often. (Since the maximum speedup by calculating back and forward at the same time is a factor of two, and with all the changes we would have to make to support this it would be a lot less than a factor of two)

pkienzle commented 3 years ago

Yes.

If I understand correctly, code and registers share local memory on the GPU warp, so increasing code size reduces the amount of space for registers, sometimes to the point where ALUs are left idle. (Old info; I didn't confirm that is still true if it ever was). So keep the code light and simple.

If it is not easy to munge the forward calculation to handle -kz, then it makes sense to reverse the stack and call it with +kz. In the extremely rare case* where we cross from -kz to +kz in the same dataset we can split it and calculate them separately. IIRC that's the logic I have in abeles.py, which uses numpy vectorization for fast calculation of non-polarized reflectivity.

Or require kz>0 always. In reduction add a "back refectivity" tag to the header for -kz data. Create two separate datasets if the scan sweeps between negative and positive and handle them with simultaneous fitting.

[*] The resolution function should never mix sign in kz. If it did we would also have to model the transmitted beam nd/or direct beam that misses the sample for those data points.

reflectometry / refl1d

restructure magnetism calculation so matrix product is the same for front and back reflectivity #133