calculation of sum is slow and uses only one core

GoogleCodeExporter commented 9 years ago

1. a = numpy.random.random((10000,10000))
2. numexpr.evaluate("sin(a) + exp(a) + log(a + 3)").sum() - fast, uses all 
cores 
3. numexpr.evaluate("sum(sin(a) + exp(a) + log(a + 3))") - slow, uses one core

I often use sum for expressions like sum(exp(a[:,None]*b[None,:])), where two 
vectors are passed to numexpr, and one number is an output. It would be great 
to avoid creation of an array a[:,None] * b[None,:] at all. 

numexpr: 2.0.1, numpy:1.6.1, OS: ubuntu 11.04

Original issue reported on code.google.com by mimak...@gmail.com on 19 Feb 2012 at 2:06

GoogleCodeExporter commented 9 years ago

I can reproduce this issue. It looks like sum (even on a single array) is not 
using multiple threads and anything inside the sum won't be accelerated.

What would be the best way to code this? I'll be happy to help.

Thanks.

Original comment by nicolas....@gmail.com on 4 Mar 2012 at 7:16

GoogleCodeExporter commented 9 years ago

+1 In a use case I am encountering, the numexpr.evaluate("sum(a)") version is 
takes over 60s to complete, uses only one core, BUT keeps the memory usage 
quite low.  OTOH, the numexpr.evaluate("a").sum() version takes just a few 
seconds to complete, uses many cores, BUT uses as much as 15GB of memory, 
albeit only momentarily.

This is a very substantial defect which appears to affect multiple users.

Original comment by jsw....@gmail.com on 31 Jul 2012 at 3:59

GoogleCodeExporter commented 9 years ago

I don't think default ndarray.sum() method is capable of using more than one 
core. 
The dirty workaround I use for myself now is to use parallel sum using OpenMP 
and weave.inline
Here's the function... 

def openmpSum(in_array):
    """
    Performs fast sum of an array using openmm  
    """
    from scipy import weave
    a = numpy.asarray(in_array)
    b = numpy.array([1.])
    N = int(numpy.prod(a.shape))
    code = r"""     
    int i=0; 
    double sum = 0;     
    omp_set_num_threads(4);
    #pragma omp parallel for      \  
      default(shared) private(i)  \
      reduction(+:sum)                 
        for (i=0; i<N; i++)
              sum += a[i];
    b[0] = sum;
    """    
    weave.inline(code, ['a','N','b'], 
                     extra_compile_args=['-march=native  -O3  -fopenmp ' ],
                     support_code = r"""
    #include <stdio.h>
    #include <omp.h>
    #include <math.h>""",
     libraries=['gomp'])
    return b[0]

Original comment by mimak...@gmail.com on 31 Jul 2012 at 6:05

shunwang / numexpr

calculation of sum is slow and uses only one core #73