sissa-data-science / DADApy

Distance-based Analysis of DAta-manifolds in python
https://dadapy.readthedocs.io/
Apache License 2.0
99 stars 16 forks source link

Parallel query on sklearn's and scipy's kdtrees #104

Closed lykos98 closed 5 months ago

lykos98 commented 7 months ago

Goodmorning, while exploring the code I found that in the function compute_cross_nn_distances() kdtrees from skelarn and scipy are utilized, but the default number of threads to use during the neighborhood search is left to the default setting which is n=1.

In a local copy of the repository I tried to set these parameters: n_jobs = -1 and workers = -1 respectively for sklearn and scipy's trees, in order to use all threads available, thus allowing to speedup computations. Is there a more profound reason to let the neighborhood query single threaded? Thank you!

diegodoimo commented 7 months ago

Hi, I just address the issue related to the kneighbor function implemented in sklearn, which we call in compute_cross_nn_distances().

Upon our initial tests, we observed that setting n_jobs = 1 doesn't seem to limit the distance matrix computation to a single thread. In my experience, when I need to control the number of threads, I set the environment variable OMP_NUM_THREADS accordingly before running the script, or at the beginning of it.

The behavior of the kneighbor function may have changed. Can you confirm that the implementation of kneighbor function in compute_cross_nn_distances() is multithreaded even if n_jobs defaults to 1?

lykos98 commented 7 months ago

Actually, from my further tests, kneighbors if n_jobs is not set, uses one thread when tree-based methods are invoked. When the method falls back to brute forcing the distance matrix, more than one thread is used, particularly half of them (I am running with hyperthreading enabled, so this is reasonable to expect from such a library).

When n_jobs = -1 I get full CPU utilization on all cores when using tree-methods, and the behavior on brute force is the same as n_jobs = 1. By looking into the documentation of sklearn setting OMP_NUM_THREADS should override the n_jobs parameter.

By the way the performance improvement of parallel search is remarkable only for datasets > 1M points when the knn search contribution to the computational time starts to be important.

diegodoimo commented 7 months ago

Yes, we may expose that parameter to the user in any case. But as you said, the n_jobs is not always equal to the threads used, which may depend on the selected algorithm (kd_tree, brute force, etc. ). For this reason, I found it easier to set the OMP_NUM_THREADS since it fixes the number of threads predictably.

lykos98 commented 7 months ago

I understand, from a user experience perspective I would prefer to be able to set the number of threads from python, to be consistent with the API of sklearn and scipy, but it is only a design choice. Since this thing does not depend from dadapy it is useful maybe to put it into the documentation somewhere.

I will investigate better the effect of OMP_NUM_THREADS and I'll let you know. Thank you!