Faster implementation for get_points_in_sphere() using ogrid

uellue commented 1 year ago

When profiling the performance of diffraction pattern simulation, most time was spent in this function.

This implementation runs more than three times faster since it reduces the calculation effort through broadcasting and re-uses coordinate and distance calculations.

Instead of using diffpy, it performs the equivalent transformation to cartesian and calculation of the norm directly to allow these optimizations.

Description of the change

Equivalent calculation using numpy.ogrid() and broadcasting to calculate hkl map, cartesian vectors, norm and selection to radius.

Progress of the PR

[x] Docstrings for all functions
[x] Unit tests with pytest for all lines
- Change should not change coverage
[ ] Clean code style by running black

Code style is dead link:

Minimal example of the bug fix or new feature

For reviewers

[ ] The PR title is short, concise, and will make sense 1 year later.
[ ] New functions are imported in corresponding __init__.py.
[ ] New features, API changes, and deprecations are mentioned in the unreleased section in CHANGELOG.rst.
[ ] Contributor(s) are listed correctly in credits in diffsims/release_info.py and in .zenodo.json.

uellue commented 1 year ago

The previous tests were run with parameters that turned out unusual, i.e. way too many reflections. Here's a more typical result, twice as fast for the whole simulation:

hakonanes commented 1 year ago

Thank you for this improvement, @uellue! The changes look good and the tests pass, so I'm happy to merge this PR.

Is it important for you to have this released soon? We don't have a release planned in the near future, as far as I'm aware, as the only current unreleased changes are maintenance updates. I plan to add some functionality to downstream kikuchipy in the next months, though. If I find anything to add or update to diffsims, I'll probably make a diffsims v0.6.0 release in connection with that.

A thought regarding speed of computations. While it's convenient to use class methods from diffpy.structure, they are not optimized for larger arrays, as you show here. If we replace more of these calls with our own computations, we could consider dropping diffpy.structure as a dependency.

uellue commented 1 year ago

Releasing can wait until it is convenient, no worries!

About replacing diffpy, their implementations didn't look so bad. This method was just a particular case because it still is THE hotspot of the entire simulation, intermediate values can be used for the return values instead of being recalculated, and the structure of the problem was perfect for broadcasting. The implementation here is pretty much equivalent to inlining the diffpy implementation. I also tried a Numba, but that didn't help. I guess that appending to a list of selected reflections depending on the norm is hard to vectorize for the compiler, so creating and then filtering the large intermediate arrays works well, too, in particular if they fit into the CPU caches. Broadcasting allows to only perform a minimum of operations on the full-size arrays.

If one wanted to speed this up further, I'd probably look into the structure of the whole simulation. I was wondering, shouldn't be the set of points in the sphere be the same for each rotation of the lattice? It feels so symmetric... In that case one could calculate a base version, select once, and then generate the rotated versions by rotating the selected part of the base? This sounds like a perfect job for GPUs, by the way.

uellue commented 1 year ago

This sounds like a perfect job for GPUs, by the way.

...resp. for GEMM on whatever platform -- should perform very well since it is a straight dot product with a rotation matrix.

CSSFrancis commented 1 year ago

If one wanted to speed this up further, I'd probably look into the structure of the whole simulation. I was wondering, shouldn't be the set of points in the sphere be the same for each rotation of the lattice? It feels so symmetric... In that case one could calculate a base version, select once, and then generate the rotated versions by rotating the selected part of the base? This sounds like a perfect job for GPUs, by the way.

Is there even a good reason that this is slow? I've been looking at this and I think @uellue is correct.

If you actually look at the code:

First the orientations are iterated through https://github.com/pyxem/diffsims/blob/fae5ca573b6ecb521d0f11fdce0f6a1e1695f6b6/diffsims/generators/library_generator.py#L119-L128

Then the function get_points_in_sphere is called a bunch of times https://github.com/pyxem/diffsims/blob/fae5ca573b6ecb521d0f11fdce0f6a1e1695f6b6/diffsims/generators/diffraction_generator.py#L243-L259

and then it is rotated....

So we can just call get_points_in_sphere once? ....

hakonanes commented 1 year ago

@CSSFrancis, regardless, should we merge this and move on?

CSSFrancis commented 1 year ago

@hakonanes The get_points_in_sphere method is going to be removed and the bigger problem is related to the fact that in this method there are repeated calls to get_points_in_sphere when there don't need to be.

It might be worth looking at the ReciporicalLatticeVector.from_min_dspacing method to see if there are changes there that could be sped up.

pyxem / diffsims