shunwang / numexpr

Automatically exported from code.google.com/p/numexpr
MIT License
0 stars 0 forks source link

The number of threads should be set only for Numexpr native pthreads, not for VML #39

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Numexpr seems to be more efficient when using native pthreads parallel code 
instead of the VML implementation, but currently, both native and VML threads 
are set to the total number of detected cores, leading to bad performance.

For example, with a 2-core machine:

{{{
>>> from numpy import pi
>>> import numpy as np
>>> import numexpr as ne

>>> i = np.arange(1e6)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")
10 loops, best of 3: 16.6 ms per loop  # 2 th native, 2 th VML
>>> ne.set_vml_num_threads(1)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")  # 2 th native, 1 th VML
100 loops, best of 3: 8.95 ms per loop
>>> ne.set_num_threads(1)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")  # 1 th native, 1 th VML
100 loops, best of 3: 14.9 ms per loop
>>> ne.set_vml_num_threads(2)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 12.9 ms per loop  # 1 th native, 2 th VML
}}}

As can be seen, the maximum performance can be achieved when setting the number 
of threads only in the native pthreads implementation.

Original issue reported on code.google.com by fal...@gmail.com on 28 Nov 2010 at 9:40

GoogleCodeExporter commented 9 years ago
Fixed in r261.  Now, the performance is optimal for the initial setup:

{{{
>>> from numpy import pi
>>> import numexpr as ne
>>> import numpy as np
>>> i = np.arange(1e6)
>>> timeit ne.evaluate("cos(2*pi*i/100.)")
100 loops, best of 3: 8.89 ms per loop
}}}

Original comment by fal...@gmail.com on 28 Nov 2010 at 9:50