The number of threads should be set only for Numexpr native pthreads, not for VML

Numexpr seems to be more efficient when using native pthreads parallel code 
instead of the VML implementation, but currently, both native and VML threads 
are set to the total number of detected cores, leading to bad performance.

For example, with a 2-core machine:

{{{
>>> from numpy import pi
>>> import numpy as np
>>> import numexpr as ne

>>> i = np.arange(1e6)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")
10 loops, best of 3: 16.6 ms per loop  # 2 th native, 2 th VML
>>> ne.set_vml_num_threads(1)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")  # 2 th native, 1 th VML
100 loops, best of 3: 8.95 ms per loop
>>> ne.set_num_threads(1)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")  # 1 th native, 1 th VML
100 loops, best of 3: 14.9 ms per loop
>>> ne.set_vml_num_threads(2)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 12.9 ms per loop  # 1 th native, 2 th VML
}}}

As can be seen, the maximum performance can be achieved when setting the number 
of threads only in the native pthreads implementation.

Original issue reported on code.google.com by fal...@gmail.com on 28 Nov 2010 at 9:40

shunwang / numexpr

The number of threads should be set only for Numexpr native pthreads, not for VML #39