scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.13k stars 25.41k forks source link

ExtraTreesRegressor predict of 1 sample is slower in 0.18.1 #8186

Closed teopir closed 7 years ago

teopir commented 7 years ago

Description

After upgrading to 0.18.1 predict of ExtraTreesRegressor is about 10 times slower.

Version 0.17

Linux-4.4.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.17')
fit: 1.26665902138
predict: 0.00265598297119

Version 0.18.1

Linux-4.4.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.1')
fit: 1.34347200394
predict: 0.0433509349823

Steps/Code to Reproduce

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)

from sklearn.datasets import make_friedman3
from sklearn.ensemble import ExtraTreesRegressor
from time import time

numpy.random.seed(6652)

X, Y = make_friedman3(10000)

forest = ExtraTreesRegressor(n_estimators=100)
start = time()
forest = forest.fit(X, Y)
print 'fit: {}'.format(time() - start)
start = time()
output = forest.predict(numpy.array([[1, 2., 5, 32.]]))
print 'predict: {}'.format(time() - start)

I installed/uninstalled through pip

pip install scikit-learn==0.17 --no-cache-dir
pip uninstall scikit-learn
pip install scikit-learn==0.18.1 --no-cache-dir

Cython version 0.23.4

jnothman commented 7 years ago

I can't reproduce this issue, at least with Cython 0.25.1.

I assume you've tested this with something more robust than a single measurement?

teopir commented 7 years ago

Yes I have tested the code over 1000 single sample predictions. Following your reply I updated the packages through pip

pip install -U cython joblib numpy scipy

and I tested the dev version of scikit-learn

pip install git+https://github.com/scikit-learn/scikit-learn.git

The issue is still there

v. 0.19.dev0

Linux-4.4.0-59-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.19.dev0')
('Joblib', '0.10.3')
Cython version 0.25.2
fit: 1.39228105545
mean time on a sequence of 1000 points: 0.045794825792312623
std time on a sequence of 1000 points: 0.0054157891052802037

v.0.17

Linux-4.4.0-59-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.3')
('SciPy', '0.18.1')
('Scikit-Learn', '0.17')
('Joblib', '0.10.3')
Cython version 0.25.2
fit: 1.27874112129
mean time on a sequence of 1000 points: 0.00260087490082
std time on a sequence of 1000 points: 8.01657810683e-05

Here the updated code

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import joblib; print("Joblib", joblib.__version__)
import subprocess
subprocess.call('cython --version', shell=True)

from sklearn.datasets import make_friedman3
from sklearn.ensemble import ExtraTreesRegressor
from time import time

numpy.random.seed(6652)

X, Y = make_friedman3(10000)

forest = ExtraTreesRegressor(n_estimators=100)
start = time()
forest = forest.fit(X, Y)
print 'fit: {}'.format(time() - start)
ntests = 1000
times = []
for i in range(ntests):
    start = time()
    x = numpy.random.randint(-1000, 1000, 4).reshape(1, -1)
    output = forest.predict(x)
    times.append(time() - start)
print "mean time on a sequence of {} points: {}".format(ntests, numpy.mean(times))
print "std time on a sequence of {} points: {}".format(ntests, numpy.std(times))
jnothman commented 7 years ago

Scratch that. I need to try this again.

jnothman commented 7 years ago
    Numpy  Sklearn Extra  RFR    DTR
Py2 1.11.2 0.17.1  7e-04  4e-04  4e-05
Py2 1.11.2 0.18.1  5e-03  6e-03  4e-05
Py2 1.11.3 0.17.1  5e-04  4e-04  3e-05
Py2 1.11.3 0.18.1  7e-03  6e-03  3e-05
Py3 1.11.2 0.17.1  4e-04  4e-04  3e-05
Py3 1.11.2 0.18.1  6e-04  7e-04  3e-05
Py3 1.11.3 0.17.1  4e-04  4e-04  3e-05
Py3 1.11.3 0.18.1  6e-04  6e-04  3e-05

The problem applies equally to ExtraTreesRegressor and RandomForestRegressor, but not to DecisionTreeRegressor, but only in Python 2 and scikit-learn >= 0.18.

jnothman commented 7 years ago

I should say, the slowness exists to a lesser extent in Python 3.

vincentpham1991 commented 7 years ago

I can try to help with this. Do have a general idea of where I should look first in the code?

dalmia commented 7 years ago

At a first glance, this delay doesn't seem obvious. Tried reverting a few changes made in forest.py and base.py (0.18.1) but with no success. Any ideas? @jnothman

glemaitre commented 7 years ago

I just found that the issue might come from the commit e0d501f3109e60f9fec12d4d4b4f6c74a0ff75ef, which is the upgrade from joblib 0.9.4 to 0.10.0 You can see the output below.

@ogrisel @lesteve Do you think that there is something in joblib which can cause such issue.

checkout e0d501f3109e60f9fec12d4d4b4f6c74a0ff75ef

('Python', '2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.dev0')
('Joblib', '0.10.0')
Cython version 0.25.2
fit: 2.93740701675
mean time on a sequence of 1000 points: 0.049894297123
std time on a sequence of 1000 points: 0.00444753804245

checkout f9d26fe15b90a0e469a68353f3bad6de7f5735e7

('Python', '2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18.dev0')
('Joblib', '0.9.4')
Cython version 0.25.2
fit: 3.10745596886
mean time on a sequence of 1000 points: 0.00370619463921
std time on a sequence of 1000 points: 7.82414638283e-05
jnothman commented 7 years ago

Interesting @glemaitre. (That would explain why DTR is not affected.) Perhaps joblib needs some kind of regression testing to make sure n_jobs=1 stays fast. The fact that this mostly affects Python 2 makes it a little trickier.

lesteve commented 7 years ago

Just curious, is it common to do plenty of one sample prediction like this ? If there is an additional overhead in joblib 0.10 for n_jobs=1, I assume that predicting many samples at once would wash out the overhead.

glemaitre commented 7 years ago

Just curious, is it common to do plenty of one sample prediction like this ?

I would say no.

If there is an additional overhead in joblib 0.10 for n_jobs=1, I assume that predicting many samples at once would wash out the overhead.

Yep ...

X, Y = make_friedman3(10000)                                                                    

forest = ExtraTreesRegressor(n_estimators=100)                                                  
start = time()                                                                                  
forest = forest.fit(X, Y)                                                                       
ntests = 500000                                                                                 
times = []                                                                                      
for i in range(10):                                                                             
    start = time()                                                                              
    x = numpy.random.randint(-1000, 1000, (ntests, 4))                                          
    output = forest.predict(x)                                                                  
    times.append(time() - start)

joblib 0.10

mean time on a sequence of 500000 points: 2.93959348202
std time on a sequence of 500000 points: 0.0516504818222

joblib 0.9.6

mean time on a sequence of 500000 points: 2.84135320187
std time on a sequence of 500000 points: 0.0429042891631
teopir commented 7 years ago

In general I think that it is not common to do a single point prediction. However, it is common with dynamical systems where your input depends on the evolution of the system.

Consider a control problem where the regressor (here ExtraTrees) represents the controller. The controller receives a state x and predict the action u to execute. If you do not know the dynamical system you can only evaluate a single state (point) when it is given to you.

jnothman commented 7 years ago

repeated single predictions are common in production systems, which scikit-learn estimators are not generally optimised for. Still this is an unexpected regression, and library code with n_jobs=1 should have minimal overhead in joblib regardless of application

On 24 Jan 2017 2:45 am, "Loïc Estève" notifications@github.com wrote:

Just curious, is it common to do plenty of one sample prediction like this ? If there is an additional overhead in joblib 0.10 for n_jobs=1, I assume that predicting many samples at once would wash out the overhead.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/8186#issuecomment-274523365, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz616Mt0fANSsCzbWe1ZpQwEPSjUQPks5rVMsJgaJpZM4LhAoK .

lesteve commented 7 years ago

Still this is an unexpected regression, and library code with n_jobs=1 should have minimal overhead in joblib regardless of application

Agreed I was asking out of curiosity. I think I have spotted where the regression comes from (timeout support if you are wondering). I'll open a issue in joblib.

jnothman commented 7 years ago

Good work!

On 24 January 2017 at 20:02, Loïc Estève notifications@github.com wrote:

Still this is an unexpected regression, and library code with n_jobs=1 should have minimal overhead in joblib regardless of application

Agreed I was asking out of curiosity. I think I have spotted where the regression comes from (timeout support if you are wondering). I'll open a issue in joblib.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/8186#issuecomment-274746533, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz64MW3CEPn6V42IDCwXiWYHO1Awlhks5rVb4_gaJpZM4LhAoK .

jnothman commented 7 years ago

Fixed in joblib. Will be rectified in next scikit-learn release assuming we upgrade joblib.

jnothman commented 7 years ago

Thanks again for reporting @teopir and @lesteve for investigating and fixing

lesteve commented 7 years ago

For the record, we are planning to do a joblib 0.11 release soonish. We will probably get it into scikit-learn master not too long afterwards.