Optimise 1d Gauss-Legendre interpolation

johnomotani commented 1 month ago

Precalculate some quantities to minimise the amount of work needed to evaluate the Lagrange polynomials, and hopefully to make the loops vectorise better. Speeds up interpolation by about 3x in one case that I profiled.

mrhardman commented 1 month ago

Wow, a 3x speed up would be impressive!

What I don't understand, and what I am wondering, is how did you know that the "for loop" version

function lagrange_poly(j,x_nodes,x)
    # get number of nodes
    n = size(x_nodes,1)
    # location where l(x0) = 1
    x0 = x_nodes[j]
    # evaluate polynomial
    poly = 1.0
    for i in 1:j-1
            poly *= (x - x_nodes[i])/(x0 - x_nodes[i])
    end
    for i in j+1:n
            poly *= (x - x_nodes[i])/(x0 - x_nodes[i])
    end
    return poly
end

would be slower than the version using Julia base functions

function lagrange_poly_optimised(other_nodes, one_over_denominator, x)
    return prod(x - n for n ∈ other_nodes) * one_over_denominator
end

In my previous life as a Fortran developer, I would have expected that there would be negligible difference between to two methods for a compiled language -- what am I missing?

Some timing checks would be helpful here to justify the changes (from the comments it looks like you are planning to do this anyway). If this method was generalised to an N-dimensional case I would be concerned that the memory requirement for storing the "other nodes" for every node might become an issue.

johnomotani commented 1 month ago

@mrhardman I don't think it was other changes that made the speedup. I didn't test separately, but the significant things are:

precompute the denominator, which gets rid of about half the arithmetic operations, and division is more expensive than multiplication/addition/subtraction.
combine the two loops (for less-than-j and greater-than-j) into a single loop, which should vectorise better.

I did also try a for-loop version of lagrange_poly_optimised(), which was essentially identical timing to the one using prod. prod was marginally faster (although the difference could have just been noise), and is more compact to write, so I chose to use that version.

I still think we'll want to do N 1d interpolations rather than one N-dimensional interpolation (although that has its own memory cost for at least one buffer array...), so I'm only anticipating 1d versions of "other nodes". It's true if we wanted an N-dimensional version, the memory usage for "other nodes" could become a concern!

johnomotani commented 1 month ago

Some timing checks would be helpful here to justify the changes

I did the one test on a case I happened to be running anyway. I was using the profiler (StatProfilerHTML) to measure how much time was spent in different parts of the code, so it's not too convenient to post the results. The speedup (factor of a few) was in line with what I'd expect, so I wasn't planning to do any more timing for this PR.

I did suggest a replacement for interpolate_2D_vspace!() that I think should be faster, but left it commented out because it would need testing for both correctness and performance, and it needs an additional pdf-sized buffer passing as an argument - I wasn't planning to include that change in this PR unless someone else has time to do the testing.

mrhardman commented 1 month ago

I did suggest a replacement for interpolate_2D_vspace!() that I think should be faster, but left it commented out because it would need testing for both correctness and performance, and it needs an additional pdf-sized buffer passing as an argument - I wasn't planning to include that change in this PR unless someone else has time to do the testing.

There is already an existing CI test for the `interpolate_2D_vspace!() function in the Fokker-Planck tests that you should be able to quickly modify if you want to include this here. You would be able to add a logical flag to the list of tests made a this line https://github.com/mabarnes/moment_kinetics/blob/73f8bf5a99dc1f07272f758208543caf3fac41f3/moment_kinetics/test/fokker_planck_tests.jl#L75, and then switch between my original interpolation and your optimised version at these lines https://github.com/mabarnes/moment_kinetics/blob/73f8bf5a99dc1f07272f758208543caf3fac41f3/moment_kinetics/test/fokker_planck_tests.jl#L121 and https://github.com/mabarnes/moment_kinetics/blob/73f8bf5a99dc1f07272f758208543caf3fac41f3/moment_kinetics/test/fokker_planck_tests.jl#L125.

Hope this helps.

mabarnes / moment_kinetics

Optimise 1d Gauss-Legendre interpolation #216