Chunk size in get_distance_matrix

maclariz commented 1 year ago

This is just a wee note that I am finding that setting large chunk sizes is much, much better and faster in get_distance_matrix. By that I mean using 1000, rather than 20. Maybe that doesn't work on every computer and not everyone has that much RAM, but I suspect that advising people to try this upwards until it fails would be a good thing and save a lot of computation time. Also, the github mentions nothing about setting "Lazy=True" which is essential for larger datasets.

Ian

hakonanes commented 1 year ago

Using larger chunks should reduce computation time at the expense of increased memory use, as you noticed. All parameters to all public classes, functions and methods are documented in the API reference. A short note about this balance is also included in the orientation clustering tutorial. However, I guess we could inform about this balance better in the API reference.

The default of 20 orientations per chunk is a conservative number, as we don't want anyone to encounter a memory issue when doing the computation lazily (say, on a computer with 4 GB RAM busy doing other stuff as well as doing this computation).

hakonanes commented 1 year ago

Thank you for raising this issue, @maclariz. We've expanded the relevant docstrings to explain these observations in more detail.

pyxem / orix

Chunk size in get_distance_matrix #429