scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.64k stars 25.31k forks source link

Why is KNearestNeighbors slower with n_jobs > 1? #6645

Open 29antonioac opened 8 years ago

29antonioac commented 8 years ago

Hi all! I've asked this on StackOverflow but I didn't get any answer :(. I can't understand why KNearestNeighbors is slower with n_jobs > 1. I've set the start method for multiprocessing to forkserver but it's near 100 times slower...

>>> import platform; print(platform.platform())
Linux-4.4.5-1-ARCH-x86_64-with-arch-Arch-Linux
>>> import sys; print("Python", sys.version)
Python 3.5.1 (default, Mar  3 2016, 09:29:07) 
[GCC 5.3.0]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.11.0
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.17.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.17.1
>>> 

I run this sample code and I post results.

#!/usr/bin/env python

import numpy as np
from sklearn import neighbors
from time import time
from sklearn import datasets
import multiprocessing as mp

if __name__ == "__main__":
    mp.set_start_method('forkserver', force = True)
    np.random.seed(123456)

    iris = datasets.load_iris()

    knn = neighbors.KNeighborsClassifier(n_neighbors = 3, n_jobs = 2)

    perm = np.random.permutation(iris.target.size)
    iris.data = iris.data[perm]
    iris.target = iris.target[perm]
    knn.fit(iris.data[:100], iris.target[:100])

    start = time()
    score = knn.score(iris.data[100:], iris.target[100:])
    end = time()

    print(score)
    print(end-start)

Output:

[antonio@Antonio-Arch Práctica 1: Búsquedas por Trayectorias]$ python cv.py 
0.94
0.001409292221069336 # Time with n_jobs = 1
[antonio@Antonio-Arch Práctica 1: Búsquedas por Trayectorias]$ python cv.py 
0.94
0.10267972946166992 # Time with n_jobs = 2

Could anyone explain me this behaviour? I've tried with so much complex datasets and the behaviour is the same :(.

Thank you!

iblancasa commented 8 years ago

I have the same problem. I don't understand this situation.

pprett commented 8 years ago

process or thread pool generation in python is extremely costly:

In [7]: %timeit joblib.Parallel(n_jobs=5, backend='threading')([])
10 loops, best of 3: 103 ms per loop

I'd assume multi-processing is equally bad or even worse.

The same issue happens when doing predictions using random forests with n_jobs>1 and you predict small batches. I suggest if you have batches smaller than 1000 or 10K rows, dont use n_jobs>1. In sklearn you can usually turn off parallel computing at predict time by setting n_jobs=1 for the fitted estimator.

jnothman commented 8 years ago

What Peter said. Multiprocessing comes at a cost, and results with a tiny dataset like Iris will tell you nothing about working at scale. Note that parallelism is only used in Nearest Neighbors at predict time.

On 11 April 2016 at 04:14, Peter Prettenhofer notifications@github.com wrote:

process or thread pool generation in python is extremely costly:

In [7]: %timeit joblib.Parallel(n_jobs=5, backend='threading')([]) 10 loops, best of 3: 103 ms per loop

I'd assume multi-processing is equally bad or even worse.

The same issue happens when doing predictions using random forests with n_jobs>1 and you predict small batches. I suggest if you have batches smaller than 1000 or 10K rows, dont use n_jobs>1. In sklearn you can usually turn off parallel computing at predict time by setting n_jobs=1 for the fitted estimator.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/6645#issuecomment-208034698

pprett commented 8 years ago

i wonder what we should do: a) put a heuristic in when to use indeed a parallel backend at prediction time or b) emit a warning when predicting small batches with n_jobs > 1?

iblancasa commented 8 years ago

Ok, thanks.

I think best solution (and easier) is B @pprett. If you want, I could try to do it.

jnothman commented 8 years ago

i wonder what we should do: a) put a heuristic in when to use indeed a parallel backend at prediction time or b) emit a warning when predicting small batches with n_jobs > 1?

Actually, I find users have issues related to data size -- too small or too large -- that the core devs think are obvious, quite frequently. Either pertaining to: issues of memory usage and time for fitting, model size on disk, potential benefits of parallelism, statistical reliability over small samples, etc. The memory/time requirements, benefits of parallelism and model size on disk could all be roughly estimated and we could provide methods on estimators to do so. But it sounds like a big engineering task, where we have not even tabulated theoretical complexities which seems an appropriate first step. Better might be to have a documentation page to refer to about sample size issues...?

amueller commented 7 years ago

I think we discussed adding computational and memory complexity to all the models. Which would be nice. Should we open an issue with a checklist consisting of all estimators? ....?

tomcwalker commented 7 years ago

Some observations I've noted on this.

On my windows 7 + AMD system, n_jobs doesn't work for K Nearest Neightbours. I set it to 6 on my 8 core system, but I get 12% CPU usage, which is a single core. I'm seeing a few python processes (3) but only 1 is using any CPU.

https://stackoverflow.com/questions/46252285/cant-get-sklearn-working-on-multiple-cores-using-n-jobs

Also, there seems to be a huge downturn in performance from KNeighborsRegressor.fit that's way beyond the levels I'd expect from the algorithmic complexity once I go from a few hundred thousand rows to a few million, We're talking going from a few minutes to many hours. The process isn't using as much memory as I might expect given the large number of rows.

Some stats below for 5 dimensions and 1 neighbor.

Rows | elapsed time in seconds 100000 | 11.5 200000 | 32 400000 | 179 600000 | 508 1200000 | 1535 3100000 | 7200+

dtemkin commented 4 years ago

I know this is a fairly stagnant issue but just something I was thinking about when I was having trouble with n_jobs using OPTICS. Could it be that clustering algorithms, in particular, don't support a multiprocessing paradigm very well due to the nature of the algorithm? It would seem to me that a "divide and conquer" approach as is done in multiprocessing would severely slow down anything that has cases which are not independent.