pabloferz / NIntegration.jl

Multidimensional numerical integration in pure Julia
MIT License
20 stars 1 forks source link

superseded by HCubature.jl? #8

Open stevengj opened 7 years ago

stevengj commented 7 years ago

https://github.com/stevengj/HCubature.jl implements the Genz-Malik algorithm in pure Julia for arbitrary dimensionalities.

pabloferz commented 7 years ago

I just saw that. The implementation here has an optimized three dimensional rule (I have also planned on adding an optimized two dimensional one). Would you be interested in migrating it or fuse both implementations over there?

stevengj commented 7 years ago

What do you gain from the optimization? HCubature precomputes the rule, so the actual application of the rule should be pretty fast.

giordano commented 7 years ago

@stevengj Since you're talking about optimizations, on METADATA repository you claimed that HCubature's performance is a bit better than Cubature, how did you measure that?

Using a very simple one-dimensional integration function in 1D (f(x) = x, in [0,1]) HCubature is two orders of magnitudes slower than Cubature:

julia> using Cubature, HCubature, BenchmarkTools, StaticArrays

julia> @benchmark HCubature.hcubature(t -> t[1], SVector(0.0), SVector(1.0))
BenchmarkTools.Trial: 
  memory estimate:  5.44 KiB
  allocs estimate:  60
  --------------
  minimum time:     119.574 μs (0.00% GC)
  median time:      136.770 μs (0.00% GC)
  mean time:        143.156 μs (4.10% GC)
  maximum time:     44.287 ms (99.62% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark Cubature.hcubature(t -> t[1], [0.0], [1.0])
BenchmarkTools.Trial: 
  memory estimate:  2.52 KiB
  allocs estimate:  61
  --------------
  minimum time:     2.970 μs (0.00% GC)
  median time:      3.071 μs (0.00% GC)
  mean time:        4.061 μs (20.14% GC)
  maximum time:     4.780 ms (99.84% GC)
  --------------
  samples:          10000
  evals/sample:     9

Or maybe I'm calling HCubature.hcubature in an non-efficient way?

stevengj commented 7 years ago

The 1d case is not as well optimized (it calls QuadGK); and usually if you are doing 1d integration it is with a less trivial integrand. If you go to 2d, with a nontrivial integrand so that you aren't just benchmarking the overhead in initializing the quadrature, then it is slightly faster than Cubature:

julia> @benchmark HCubature.hcubature($(t -> sin(10t[1]*t[2])), $(SVector(0.,0.)), $(SVector(1.,1.)), rtol=1e-8)
BenchmarkTools.Trial: 
  memory estimate:  941.08 KiB
  allocs estimate:  34938
  --------------
  minimum time:     1.372 ms (0.00% GC)
  median time:      1.472 ms (0.00% GC)
  mean time:        1.596 ms (4.92% GC)
  maximum time:     4.261 ms (53.66% GC)
  --------------
  samples:          3128
  evals/sample:     1

julia> @benchmark Cubature.hcubature($(t -> sin(10t[1]*t[2])), $(SVector(0.,0.)), $(SVector(1.,1.)), reltol=1e-8)
BenchmarkTools.Trial: 
  memory estimate:  1.93 MiB
  allocs estimate:  42118
  --------------
  minimum time:     1.846 ms (0.00% GC)
  median time:      1.981 ms (0.00% GC)
  mean time:        2.155 ms (4.96% GC)
  maximum time:     5.086 ms (34.67% GC)
  --------------
  samples:          2317
  evals/sample:     1

(It will get even better once Julia can stack-allocate temporary structs; maintaining the heap of integration subregions is currently a bigger overhead than I would like.)

stevengj commented 7 years ago

(I just spotted the problem with the 1d case: it is re-computing the Kronrod rule, which involves solving an eigenproblem, every time. I thought the QuadGK.kronrod function cached the rule, but actually that is done elsewhere in QuadGK. I will update it to cache the rule and then I will try it again.)

stevengj commented 7 years ago

Okay, the 1d case is now fixed too (or will be once I push a commit); thanks for spotting this!

julia> @btime HCubature.hcubature($(t -> t[1]), $(SVector(0.)), $(SVector(1.)), rtol=1e-8)
  2.811 μs (36 allocations: 1008 bytes)
(0.5, 0.0)

julia> @btime Cubature.hcubature($(t -> t[1]), $(SVector(0.)), $(SVector(1.)), reltol=1e-8)
  13.540 μs (82 allocations: 3.36 KiB)
(0.5, 5.551115123125783e-15)
stevengj commented 7 years ago

Further speedups should be possible in the multidimensional case by caching the cubature rules there too, but I'm not sure it's worth it; normally I would expect multidimensional cubature to require enough function evaluations that the start-up cost of computing the rule doesn't matter?

stevengj commented 7 years ago

For 2d integrals with a trivial integrand, I got a 40% speedup by caching the Genz–Malik rule. For more complex integrands that require adaptive refinement, the difference is negligible. But I guess integrating low-degree polynomials is pretty common (e.g. in FEM packages), and it only takes a few lines of code to cache the rule (for a given dimensionality and floating-point type), so I pushed that commit too.

stevengj commented 7 years ago

(Whoops, I just noticed that I acknowledged @pabloferz rather than @giordano in the commit message for https://github.com/stevengj/HCubature.jl/commit/5e1b559ca17f2d63d940cb803888957efe26b4fa, sorry.)

pabloferz commented 7 years ago

The rule used here for 3 dimensions is not the same as the original Genz-Malik rule, it's an optimization on the rule in that it uses less function evaluations per region and its a 11th degree rule instead of a 9th degree one.

On the "An Adaptive Algorithm for the Approximate Calculation of Multiple Integrals" paper by Berntsen, Espelid, and Genz they mention that there's a 13th degree rule that can be used for 2-dimensional integrals and the 11th degree rule used here for 3d. For 4-dimensional or higher integrals I believe the only rule reported to date is the original Genz-Malik rule.

pabloferz commented 7 years ago

Additionally, I slightly modified the Berntsen-Espelid 11th rule, originally reported on [REF], so the computation of the error in the 3d case is a bit faster.

[REF] Jarle Berntsen and Terje O. Espelid: On the construction of higher degree three dimensional embedded integration rules. Reports in Informatics, University of Bergen, May 1985.

pabloferz commented 7 years ago

Another thing I noticed is that over HCubature the original algorithm Genz-Malik its also used. Here I used the algorithm from "An Adaptive Algorithm for the Approximate Calculation of Multiple Integrals" (Its actually a one-to-one implementation of the algorithm described there). The main differences is that its supposed give better error estimates and has a better region subdivision scheme.

stevengj commented 7 years ago

If you have a higher-degree rule that uses fewer function evaluations, that sounds great; a PR would be welcome (just put it into a separate file and define a new cubrule method that returns it for 3d). Do you have a reference for it?

Regarding the 1991 Berntsen and Espelid paper you mention, from my understanding there are three main modifications compared to Genz and Malik. First, they handled vector-valued integrands, but HCubature handles this too (with a user-specifiable norm). Second, they have some advice about parallelization, but if I implement this I like the algorithm by Gladwell (1987) described here better. Third, they describe a "two-level" error-estimation scheme that supposedly makes it more robust, but I remember looking at this at one point and being a bit skeptical; I wrote in my notes that it substantially increased the number of function evaluations for some of my test integrands. But maybe I should revisit it, at least as an option.

pabloferz commented 7 years ago

Now I remember, the rule used here doesn't use fewer function evaluations (127) than the Genz-Malik one (77) for 3D integrals, so in principle for highly oscillating functions it could be faster. Regardless, I remember comparing the performance the implementation here against Cubature (maybe that won't be the case with HCubature) and found this (with the "two-level" error-estimation scheme) to be at least 2x faster even for small degree polynomials.

Regarding parallelization I also thought that their strategy wasn't the best one.

stevengj commented 7 years ago

For low-degree polynomials the error-estimation scheme is irrelevant. For a low-degree polynomial (where one evaluation of the rule suffices) on my machine, HCubature is 5x faster than Cubature:

julia> @btime HCubature.hcubature($(t -> t[1]), $(SVector(0.,0.,0.)), $(SVector(1.,1.,1.)), rtol=1e-8)
  2.979 μs (37 allocations: 1.45 KiB)
(0.49999999999999994, 5.551115123125783e-17)

julia> @btime Cubature.hcubature($(t -> t[1]), $(SVector(0.,0.,0.)), $(SVector(1.,1.,1.)), reltol=1e-8)
  15.600 μs (132 allocations: 5.70 KiB)
(0.49999999999999994, 5.551115123125783e-17)
stevengj commented 7 years ago

I just optimized the case where no subdivisions are required a bit further and it is now down to 1.914 μs, or 8x faster than Cubature.hcubature for a trivial 3d integrand. (Haven't pushed that commit yet.)

stevengj commented 7 years ago

(Note that in some sense the comparison to Cubature is unfair. If you called the C library directly it would be faster — there is a price paid for marshalling/unmarshalling the arguments for the ccall so that the Julia callback doesn't need to deal with raw pointers.)

pabloferz commented 7 years ago

I agree that for low-degree polynomials the error-estimation scheme is irrelevant.

For your example above, I get on my machine on 0.6 (this package is even faster on 0.5 and even more on master)

julia> let f = x -> x[1]
           @btime HCubature.hcubature($f, $(SVector(0.0, 0.0, 0.0)), $(SVector(1.0, 1.0, 1.0)))
       end
  982.000 ns (16 allocations: 400 bytes)
(0.49999999999999994, 5.551115123125783e-17)

julia> let f = (x, y, z) -> x
           @btime NIntegration.nintegrate($f, (0.0, 0.0, 0.0), (1.0, 1.0, 1.0))
       end
  605.363 ns (40 allocations: 912 bytes)
(0.5000000000000001, 7.412159432224628e-17, 127, 1 subregion)
stevengj commented 7 years ago

Pretty close! For such a simple integrand with no adaptivity (single subregion) we are mostly measuring overhead in looking up and applying the rule, iterating over arrays, and things like that, and it wouldn't surprise me to get a 30% speedup by unrolling everything the way you can in 3d.jl.

stevengj commented 7 years ago

In practice, I think it would be rare to have an integrand this cheap, so I'm skeptical that it's worthwhile to further optimize the rule-evaluation code.

(Even for an FEM code where people often integrate low-degree polynomials where the rule is exact, the integrand evaluation involves looking up things in mesh datastructures, coordinate transformations, etcetera.)

Performance-wise, the biggest annoyance for me at the moment is the number of allocations involved in managing/subdividing the region heap (in cases where lots of subdivisions are allowed), and I suspect you have similar issues? Hopefully this will go away once an immutable/struct like Region (what I called "box") can be stack-allocated? I see that you tried to improve things by adding mutable fields to your Region with a Box type (why not just use RefValue?).

pabloferz commented 7 years ago

I tried to cut down allocations as much as possible, but this is as far as I could get without stack allocated structs. I tried using a RefValue instead of a mutable, but I think it was allocating more, so I left it like that (this may have changed in master).