GPU computing of get_distance_matrix?

maclariz commented 1 year ago

I was wondering if get_distance_matrix could go faster by using a GPU, which seems to be a possibility with dask functions. What do you think?

hakonanes commented 1 year ago

Orientation.get_distance_matrix() runs in parallel with both NumPy (lazy=False) and Dask (True) on my machine, does it not do so on yours?

With my use of the method, I usually fill the available memory before becoming impatient...

maclariz commented 1 year ago

I can set lazy=False in the method. I was not aware of any Dask setting, from your notes on the function, but dask is, of course, installed.

Best wishes

Ian

On 3 Aug 2022, at 19:12, Håkon Wiik Ånes @.**@.>> wrote:

Orientation.get_distance_matrix() runs in parallel with both NumPy (lazy=False) and Dask (True) on my machine, does it not do so on yours?

With my use of the method, I usually fill the available memory before becoming impatient...

— Reply to this email directly, view it on GitHubhttps://github.com/pyxem/orix/issues/371#issuecomment-1204311387, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APHUIKRC4SIEOFKSMH2PDR3VXKY75ANCNFSM55PS4DTQ. You are receiving this because you authored the thread.Message ID: @.***>

Dr Ian MacLaren (he, him, his) BSc (Hons), PhD, FInstP, CPhys Reader in Physics

Phone contact currently not advised due to Covid-19 and not being in my office regularly

Materials and Condensed Matter Physics School of Physics and Astronomy University of Glasgow / Oilthigh Ghlaschu Glasgow G12 8QQ

http://www.gla.ac.uk/schools/physics/research/groups/mcmp/

https://publons.com/researcher/C-1773-2010/ ORCID: 0000-0002-5334-3010

The University of Glasgow, charity number SC004401

hakonanes commented 1 year ago

The method reference is always a good place to see what the method can do!

Please close the issue if you're happy. If not, is there anything we should fix or improve?

maclariz commented 1 year ago

I tried lazy=False. The thing failed instantly due to memory requirements! This trick probably only works for very small datasets. I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient. I will close for now, but this seems an area where improvement should be possible.

hakonanes commented 1 year ago

This trick probably only works for very small datasets

Yes, this is unfortunately true. One simple approach is to allow a reduced floating point precision of 32-bit instead of the current 64-bit. I think seven decimals should be enough. This is not something the user can do by themselves, the code needs to change.

I will have a think about the mathematics one day when I have time and see if there is a strategy that could be used to make this more efficient

That would be great!

maclariz commented 1 year ago

@hakonanes One way you could speed this up is by adding a CUDA variable to the function with a Boolean input.

If False, works as it did.

If True, replace all calls on array operations from np to cp (with import cupy as cp).

So for example, cp.tensordot.

Obviously only helps if you have a GPU set up for processing, but would help some users of higher end systems.

hakonanes commented 1 year ago

I agree that supporting some computations on GPU using CuPy would be benefitial. Perhaps a good approach is to start small, supporting it only on this method, and then develop a framework with time. I don't know.

CuPy needs an NVIDIA GPU, and I have an Intel graphics card, meaning I cannot test this and would not benefit from working on this. I would be happy to review a PR, though.

If we start to support GPU computations with CuPy, it should be an optional dependency (via an orix[gpu] pip selector).

maclariz commented 1 year ago

I have an NVIDIA GPU and can test. I am running other GPU enabled functions using cupy and seeing big speedups.

Perhaps draw up a list of functions that need updating if we do this.

maclariz commented 1 year ago

So, following functions are supported by cp and can be ported:

einsum arccos nan_to_num zeros round outer

I presume this is then possible.

maclariz commented 1 year ago

@hakonanes On starting small, in my limited experience with orix, this really is the only really memory and processor hungry operation there. Everything else is really quick for me, even on a laptop. But this function takes hours at the best of times...

So, this is perhaps the one obvious point for me where parallel processing on CUDA is really worthwhile.

hakonanes commented 1 year ago

I have an NVIDIA GPU and can test

That’s good. Actually, I have to backtrack, sorry, but I cannot review such a PR by myself as I don’t have an NVIDIA GPU to test the function… Would need help from someone who does (@harripj?).

And yes, supporting GPU computation in this function only is a good place to start in my opinion.

I’m unfamiliar with GPU processing using CUDA (only know pyopencl), but I assume this will not reduce memory use in any way.

maclariz commented 1 year ago

On memory use, if a larger map is used, as I found out the other day, the memory required could run to hundreds of GB. There is no way this works in a single chunk on most machines. As such, lazy processing with dask will be necessary for most cases. Thus, you could do it by chunking to moderate size chunks (e.g. 8-10 GB chunks with our GPU, which can take up to 12 GB), and each chunk goes much faster because it uses cupy.

It might be useful to have a wee prep function just to find out memory is needed for a given chunk size, to allow the user to guess a reasonable chunk size for the later computation that would fit with the memory they actually have available.

If you want testing, we can certainly help with our server. If needed, I could ask about a guest login from outside.

maclariz commented 1 year ago

I also had another idea. Working out literally every misorientation pair in the whole image is really overdoing the problem and probably totally unnecessary. Perhaps sampling every _n_th orientation in the dataset for comparison to the full set of pixels would reduce memory requirements and computing time by a factor n. So, you could safely do every 2nd pixel, and probably every third or fourth and get exactly the same results.

maclariz commented 1 year ago

Basically the same thing that makes SVD work - the problem is often rather oversampled and the same features turn up often in the dataset, so a subsampling will still find the same features as analysing every datapoint.

pyxem / orix

GPU computing of get_distance_matrix? #371