Summing unusably slow - Githubissues

nmearl commented 10 years ago

I just want to make sure this is a limitation of the package, and not my own fault. But it seems that summing large arrays is prohibitively slow. I have several arrays with about 20,000 points, each with errors, but it's impossible to find the mean because it takes too long to calculate. To be more specific, I'm getting the mean to normalize the array

alpha = 1.0 + (alpha - alpha.mean()) / (alpha.max() - alpha.min())

Is there any quicker way this can be done? Thanks.

lebigot commented 10 years ago

This takes way too long on my machine too, with NumPy 1.8.1.

I remember summing array elements going faster. It was not very fast, but it was much faster, if my memory serves well. I'm going to have a closer look.

On the long term, the real solution is to have uncertainties internally handle arrays of numbers with uncertainties through an automatic differentiation package which is fast for this case. There are a few out there. This is currently quite high on my list of priorities for my next long vacation. :)

lebigot commented 10 years ago

I did some timing tests, which I reproduce here for reference:

>>> import numpy as np
>>> from uncertainties import unumpy as unp

The calculation time is exponential. I did not expect this at all:

>>> arr = unp.uarray(np.arange(20), np.arange(20)/10)                           
>>> %timeit arr.sum()
1000 loops, best of 3: 779 µs per loop
>>> arr = unp.uarray(np.arange(200), np.arange(200)/10)
>>> %timeit arr.sum()                                                           
10 loops, best of 3: 33.9 ms per loop
>>> arr = unp.uarray(np.arange(2000), np.arange(2000)/10)                       
>>> %timeit arr.sum()
1 loops, best of 3: 2.9 s per loop
>>> arr = unp.uarray(np.arange(20000), np.arange(20000)/10)                     
>>> %timeit -r 1 arr.sum()
1 loops, best of 1: 5min 37s per loop

The Python built-in sum() function on the array runs in about the same time as NumPy's sum(), so NumPy's summing is itself not the problem:

>>> arr = unp.uarray(np.arange(2000), np.arange(2000)/10)
>>> %timeit sum(arr)
1 loops, best of 3: 2.93 s per loop

Completely removing NumPy is as slow:

>>> arr_list = list(unp.uarray(np.arange(2000), np.arange(2000)/10))
>>> %timeit -r 1 sum(arr_list)
1 loops, best of 1: 3.16 s per loop

So, the problem lies entirely within uncertainties…

The resulting sum depends on 2000 random variables and uncertainties 2.4.6 is not handling this efficiently, apparently. I'm not yet sure why. This warrants some deeper investigation (profiling…).

andyfaff commented 9 years ago

Is there any progress on this issue? I've just started using uncertainties and there are several places which could do with a speedup.

lebigot commented 9 years ago

I have not looked at this yet (having two jobs takes a toll!). Now, this looks like an issue that's probably concentrated on a small part of the code, so I left myself a note for early July: I may have time then to investigate this. Anybody is welcome to investigate the issue too. :) Knowing what part of the code causes the exponential slow-down would help. I still consider the slow-down surprising, so I have hope that it can be solved.

andyfaff commented 9 years ago

I use the following code in an ipython cell after the necessary imports:

d = np.arange(2000) * 100
e = unp.uarray(d, np.sqrt(d))

I get the following output:

14054975 function calls in 4.046 seconds

Ordered by: internal time

ncalls tottime 1999 2.283 6002997 1.071 6002997 0.332 2002998 0.286 1 0.055 0.055 5998 0.004 1999 0.003 1999 0.002 1999 0.002 5997 0.002 9995 0.002 3998 0.001 3998 0.001 1999 0.001 1999 0.001 1999 0.000 1999 0.000 1 0.000 0.000 1 0.000 0.000 1 0.000 0.000 1 0.000 0.000 percall cumtime percall filename:lineno(function) 0.001 3.991 0.002 init.py:810(f_with_affine_output) 0.000 1.403 0.000 init.py:2814(hash) 0.000 0.332 0.000 {id} 0.000 0.286 0.000 init.py:950() 4.046 4.046 {method 'reduce' of 'numpy.ufunc' objects} 0.000 0.010 0.000 {isinstance} 0.000 0.006 0.000 abc.py:128(instancecheck) 0.000 0.002 0.000 init.py:1708(init) 0.000 0.002 0.000 _weakrefset.py:68(contains) 0.000 0.002 0.000 init.py:943() 0.000 0.002 0.000 {method 'iteritems' of 'dict' objects} 0.000 0.001 0.000 init.py:588(getitem) 0.000 0.001 0.000 init.py:1738(nominal_value) 0.000 0.001 0.000 {getattr} 0.000 0.001 0.000 init.py:835() 0.000 0.000 0.000 :1() 0.000 0.000 0.000 {method 'itervalues' of 'dict' objects} 4.046 4.046 :1() 4.046 4.046 fromnumeric.py:1631(sum) 4.046 4.046 _methods.py:31(_sum) 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

So most of the time is spent in f_with_affine_output.

andyfaff commented 9 years ago

I then installed line_profile and ran the following code (using the @profile decorator on the f_with_affine_output function).

import numpy as np
import uncertainties.unumpy as unp
d = np.arange(2000.)
e = unp.uarray(d, np.sqrt(d)) 
np.sum(e)

Using the command kernprof -l -v test.py. The lines 956-958 take the vast of the time:

https://gist.github.com/andyfaff/96ed26a32d9899e2ac5f

Those lines are a nested for loop.

andyfaff commented 9 years ago

Investigating further with line_profiler, in the previous post line 958 in __init__.py got hit 2000999 times for an array with size 2000. When I make the array 200 then line 958 gets hit 20099 times, a factor of 99.6 times less. I.e. the scaling is O(N**2).

lebigot commented 9 years ago

Wow, thanks for all these tests. I'll have a close look in the coming days.

lebigot commented 9 years ago

I had a look. What you observe is normal: every time one term is added, the program calculates the derivatives of the new sum with respect to all the variables summed so far (this is needed information), which explains the quadratic time.

I thought about how to optimize this. The following is a little technical, but it is useful as a reference:

When summing independent variables and adding a new term, none of derivatives of the already calculated sum changes. In an ideal world, they would be kept as is in memory, and only the derivative of the sum with respect to the new term would be registered. In a general sum of two correlated terms, only derivatives with respect to the independent variables that are common to the two terms need to be updated in order to get the derivatives of the sum.

The problem is that each time a term is added, the program ends up creating a new dependent variable, so all the previously known derivatives must be eventually copied from the already calculated sum (another independent variable) variable) to the new one. This is natural because in code like sum0 = x+y; sum1 = sum0 + z: the derivatives of sum1 with respect to the independent variables x, yand z are different from the derivatives of sum0, so the derivatives of sum0 and sum1 must be kept separate in memory. Python's summation model works like this sum0/sum1 code: terms are added one by one, so I can't see for now a way of having the code notice a big sum of independent terms, with no needed intermediate dependent variables (the partial sums).

Now, it would be great to find a mechanism, at least for sums, through which the derivatives of the terms are not copied when adding a new term: this would make the summation linear in time, for independent variables. The difficulty, is that in general, the derivatives should be copied (like in the example of sum0 and sum1). This mechanism would thus work like hard links: no copy is performed until one of the files must be updated.

This "no copy of derivative values" optimization would be quite specific to the sum function, since its partial derivatives are 1 (many existing derivatives would thus not have to be updated, as they are technically only multiplied by a factor of 1). That said, this is such an important case that it is worth investigating the issue.

I will think again about this in about a month. It does not look so easy, and there would probably be an overhead for all functions but the sum, but it may be worth it—if at all possible.

lebigot commented 9 years ago

PS: the idea below is not the simplest or most flexible one. See the next comments for a better idea (formal, lazy linear expansions of the calculated functions).

I think I know how summing could be sped up: by using the collections.ChainMap class in order to store the derivatives of the sum, based on the derivatives of each term with respect to the independent variables. This ChainMap would replace the current .derivatives mapping (if I remember correctly what I did), but only for sums. This would avoid copying the derivatives of the previous partial sum and merging them with the new variable each time a term is added, as is done currently. The summation would become linear in time instead of quadratic.

This idea is different from the idea of some kind of "hard links" to existing lists of derivatives (see my previous comment). In fact, the derivatives of a variable with respect to its independent variables is fixed, so there is no need for a "hard link" to these derivatives.

A difficulty is that ChainMap was introduced in Python 3.3. Can it be backported?

In about a month, I will:

Check that this idea can work.
Implement it and do a speed test.
See if it can be backported to Python < 3.3.

lebigot commented 9 years ago

Note to self:

Even multiplication could in principle be accelerated similarly, with a generalization of the ChainMap method: when multiplying to variables, the independent variables that are not shared by the factors could have their derivatives represented as pairs (second_factor, derivatives_of_the_first_factor)—and symmetrically for the other factor. The common independent variables would have a derivative which is calculated by the usual chain rule (uv'+u'v); the new derivatives (that are not simply a factor applied to the old ones) would be stored in a new mapping {variable: derivative} put in front of the ChainMap.

In order to keep most of the code as is, the new "factored mappings" (factor, derivatives) should behave like dictionaries, with the factor automatically applied to the derivative, when needed.

This method would prevent the derivatives of the factors to be duplicated, at the cost of a computing overhead each time that a derivative needs to be evaluated, so it is not clear that it would be always efficient.

In order to solve this problem, a more advanced mechanism could be applied: when the (factor, derivatives) has to calculate too many products between the factor and the derivatives, they could all be calculated and the (factor, derivatives) could be replaced by a newly created derivatives mapping (where the factor has been applied to the original derivatives).

This model can be generalized to functions of a single argument: the derivatives of f(g(x,…)) are simply the derivatives of g(…) multiplied by a factor (f'(g(…)). Factors could be combined on the fly: the derivatives represented by (factor1, (factor0, derivatives)) would simply be (factor1*factor0, derivatives), pretty much in the same was as the chain rule is currently applied. I am not completely clear yet on the implications of this idea, but I feel that it ultimately points to replacing the current creation of a new derivatives mapping for each operation (which slows down to quadratice time repeated operations like the sum) by a different structure that refers differently to the independent variables involved. To be studied.

lebigot commented 9 years ago

Note to self:

I kept thinking about an optimization for sums, that could also work in the general case. I came up with the following principles:

Instead of applying the chain rule for each operation, calculate it lazily. The result can be represented by a tree generated by: leaf = variable (or Multiply(1, variable)); tree -> Multiply(coefficient, tree); tree0, tree1,… -> Sum(tree0, tree1,…). The tree represents the same linear approximation of the function as in the current code (where this approximation is stored as a mapping {variable: coefficient,…}. The advantage of this approach, compared to the current one, is that existing expressions are reused (instead of building a new mapping {variable: coef,…} (linear approximation representation) for each operation.
When a numerical uncertainty is needed, the chain rule expression can be evaluated (mathematically, the formal expression of the linear expansion is expanded, and terms with the same variable are gathered together). The calculation can replace each sub-expression recursively with its result, so that variables that depend on these sub-expressions take less memory and are evaluated faster later. The final result is a linear combination coef0_variable0 + coef1_variable1 + … with all the variables being different; this is equivalent to the current representation of the linear approximation fo the function with a mapping. It could be represented as a tree too, for a uniform treatment by the code: Sum(Multiply(coef0, variable0), Multiply(coffee, variable1)). However, some mechanism must be found so that such an evaluated expression is not re-evaluated again if it is needed by another part of the program. Maybe a flag could indicate it is evaluated (i.e. that the tree is flat and that all the variables are different), which would be checked so as to efficiently generate derivatives or the total uncertainty when needed (these actions would typically prompt the evaluation and set the flag, after which the evaluation would not be performed again); another option would be to store evaluated and non-evaluated expressions differently (e.g., with a mapping and a tree, respectively), maybe with classes that have the same interface (this would remove the need for testing a flag or the type of the expression).

With this method, summing independent variables is linear in time (instead of quadratic). More generally, summing expressions that do not contain identical independent variables is linear too. Even more generally, the calculation time is essentially proportional to the number of variable occurrences (a variable can count for more than one) in the expression (after each term is evaluated).

Optimizations can be performed on expression/trees as needed. For example:

An expression with a single term can be multiplied without a memory increase: Multiply(coef1, Sum(Multiply(coef0, variable))) = Sum(Multiply(coef1*coef0, variable).
It may happen that some operations add many terms together that contain the same single (independent) variable: this single variable can end up occupying a lost of memory (one piece of memory in each of the expressions that contain it). A tree object could keep a count of all the individual (independent) variables it contains, and an evaluation of the subtrees that contain variables with many occurrences can be grouped together (or, more simply, the whole expression can be evaluated).

This approach is simpler than the ChainMap approach above, which insisted on always evaluating chain rule expressions, while leaving untouched sub-expressions and correcting them for the needed variables (variables that belong to more than one term). It does take more memory, though, but in practice many functions have arity one, so the first optimization above should be good.. Furthermore, the depth of the expression trees should be small (a high tree would be obtained with sin(sin(sin(sin(…)))), which is a rare case—which would be handled by the optimization above, if needed.

lebigot commented 9 years ago

Note to self: the "lazy evaluation of derivatives via formal expression" idea above does not solve the problem of the slowness of sum(). In fact, constructing a sum of numbers with uncertainties is fast (linear in time), but the natural way of calculating the final uncertainty is slow (quadratic, like before). This "natural" way consists in recursively calculating the derivatives of each intermediate sum (i.e. of …+…, then of …+…+…, etc.). This is natural because in general variables depend on each other, so it makes sense to memorize any intermediate result, as in:

x = … + …
y = x + …
z = y + …

This time complexity thus essentially comes from the fact that Python constructs …+…+…+…+… as pairwise sums ((…+…)+…)+…: each sum then "wants" to calculate its derivatives with respect to the variables involved, which yields a quadratic run time.

Now, this is not good for summing many numbers. What we want in this case is a gathering of all the terms in ((…+…)+…)+…, without calculating the derivatives of the intermediate results. This can be done in linear time. However, finding a method that also satisfies the requirements of:

not copying any of the intermediate sum derivatives—as is currently done, and as would be done in a natural recursive calculation of the derivatives of the full expression—, so as to keep the time linear;
still keeping in memory the derivatives of the intermediate sums (as this is generally desirable, as intermediate variables are generally of interest).

One solution, at least for sums, might be to blend the two ideas above:

lazy evaluation of derivatives through formal expressions,
copy-less expand-and-collect operation when simplifying the formal expressions and collecting the derivatives with respect to each variable (maybe with ChainMaps).

In effect, the speed up would be based on the calculation of formal expressions, and on their efficient expand-and-collect operation (which collects the numerical coefficient in front of each independent variable in the linear expansion of the calculated function) for some formulas, in particular the sum of many independent variables, where this can be done linearly in time.

This architecture might be general enough to allow for more efficient expand-and-collect strategies to be added (maybe the sum is not a case which is too particular).

lebigot commented 9 years ago

PS: ChainMaps actually do not solve the quadratic time problem, because the lookup of shared variables between terms would become linear, for each additional term (formal coefficients stored as {x:3, y:4}, {z: 5}, {t: 6},…, with each new term requiring a linear lookup so as to combine the new term with any existing term with the same variable). The current situation is that the collection of the terms of a long formal sum is quadratic (with the formal sum being calculated linearly in time). Maybe it is possible to introduce a new class that would make the collection linear in time while preserving the generally good idea of doing the collection for intermediate variables?

Additional challenge: have both sum() and cumsum() run fast (linear time).

lebigot commented 9 years ago

Note to self:

Goals:

Calculation of the uncertainty of a single large sum in linear time.
Calculation of the uncertainties of a cumulative sum in linear time (this is a common application).

Difficulties:

A large sum is typically expressed in Python as many binary additions. The quadratic summation time in uncertainties 2.4.6.1 comes from the linear number of term at each step of the summation.

These requirements seem to essentially imply the following approach (which differs from the ideas above, which do not fully work):

A lazy calculation of summations, where each step of the summation is done in constant time (instead of linear time), despite the linearly growing number of terms.
The storage of information related to the terms of a large sum in a single variable (that grows linearly), instead of in a new copy of the previous terms with an additional new term for each binary addition.
Intermediate summation results should essentially be a view on (one of) the larger sums in which they are embedded. Concretely, the terms of the total sum would typically be represented by a list or array, and the terms of the intermediate sums would be represented by indexes in this array. Thus, intermediate sums do not create linearly increasing data structures (which would yield a quadratic execution time). When summing two (formal) sums that have some random variables in common, the linear coefficient associated to these variables (by the linear theory of error propagation) could be stored in a separate structure that contains "corrections" to the main, unique list or array that contains the result of larger sums (the goal is to have a fast method for summing mostly uncorrelated values, and a method which is quadratic when summing strongly correlated terms—i.e. terms with many random variables in common).

To be done:

Define a concrete algorithm which runs linearly in time (at least for uncorrelated terms) and that implements these ideas. Make sure that it meshes well with:
- Other, non-summation operations.
- Cumulative sums.
- The case of terms with many random variables in common (the speed should be quadratic at worst).

sashabaranov commented 8 years ago

I've also got an issue with slow summing. Solution for me was to write much simpler implementation of ufloat with limited functionality and use it instead: https://gist.github.com/sashabaranov/9c32849d1a9d89a81a8a

Maybe we can do something like this in uncertainties library? C-implementation can also speed things up.

lebigot commented 8 years ago

@sashabaranov Your implementation is restricted to independent variables, which indeed speeds things up because it does not have to keep track of interdependent variables. The goal of the uncertainties package is to instead give exact (first order approximation) uncertainty calculations (your module fails to give a 0 uncertainty for x-x, for instance). As for using C, this would not help with the current O(n^2) complexity, which is due to memory copies.

lebigot commented 8 years ago

Note to self: when combining numbers with uncertainty, maybe a copy could be avoided by checking whether the numbers have any reference to them (sys.getrefcount(), gc.getreferents()) and only updating one of the numbers instead of copying it completely as is done currently? This would solve the O(n^2) complexity issue that slows down large summations.

lebigot commented 8 years ago

Note to self: the idea of the lazy evaluation of derivatives might work if the final result is the only thing which is stored in memory (instead of keeping in memory the derivatives of each intermediate term, as is currently done). Now, a disadvantage is that the resulting absence of "caching" will slow down situations like x = 1±1, y = f(x), z = g(y), if getting the uncertainty on z before the uncertainty on y (because the calculation of derivatives for z will not cache derivatives for y). Maybe the more technical suggestion above of counting references to quantities would help (intermediate results in x = a+b+c+… would not have the same reference count as x—but this is to be tested, as the devil is in the details).

lebigot commented 8 years ago

Some news: I have started implementing a linear-time version of the idea above (https://github.com/lebigot/uncertainties/tree/master_faster_uncert_calc). Summing 5,000 independent variables went from 17 s to 4 s (MacBook early 2015) and the algorithm should be linear instead of quadratic (I have not yet tested this). But this implementation is recursive (basically evaluating the uncertainty on ((a+b)+c)+d+… based on the uncertainty on (a+b)+c, etc.), and this hits Python's recursion limit, which cannot be increased ad infinitum (increasing it to 50,000 and summing a few tens of thousands of terms crashes). However, I now have a non-recursive implementation in mind. I will implement it and see.

lebigot commented 8 years ago

Note: I turned to SymPy for ideas and summed 5,000 formal terms (with no calculation done whatsoever). After 30 minutes, the calculation had not returned. uncertainties is comparatively actually fast, as it even does calculations, when summing. :D

lebigot commented 8 years ago

Some great news:

Progress report on this issue, based on (incomplete) version 84a27f5 on my laptop: summing 5,000 terms takes only 0.070 s with the new version (down from 17 s with the published version: a 250x speedup). Summing 20x more terms (100,000 terms) takes 20x times more time (1.4 s, down from 170 min, a 7,000x speedup), and summing an additional 10x more terms (1,000,000 terms) takes 10x more time (15 s). Thus, the new calculation is indeed linear in time (instead of quadratic), and absolute calculation times are much, much shorter. Very nice.

Now, more work is needed until all the tests pass: the sum() works, but other functions and features are currently broken because I still need to update other parts of the code (despite the fact that I did not code anything specific for sum(), but only changed the general uncertainty calculation implementation).

The new implementation is in branch https://github.com/lebigot/uncertainties/tree/master_faster_uncert_calc, which will probably be merged into the main branch (master) at some point, since I don't foresee any blocking point.

The remaining difficulties seem to involve mostly NumPy support (49 in 54 unit tests pass).

lebigot commented 8 years ago

The master branch now has the (much) faster implementation. All tests pass.

lebigot commented 8 years ago

The latest release now includes the much faster uncertainty calculation.

PyPI release: https://pypi.python.org/pypi?:action=display&name=uncertainties&version=3.0.1 GitHub release: https://github.com/lebigot/uncertainties/tree/3.0.1

andyfaff commented 8 years ago

Thank you Eric.

On 16 August 2016 at 07:00, Eric O. LEBIGOT (EOL) notifications@github.com wrote:

Closed #30 https://github.com/lebigot/uncertainties/issues/30.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lebigot/uncertainties/issues/30#event-756204558, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq51pPyUZmbhCDmY3mPFIsJaQY7j0Jdks5qgNOGgaJpZM4CDu0_ .

Dr. Andrew Nelson

lebigot commented 8 years ago

Thank you @andyfaff for your input on this subject, a year ago: this pushed me to think more about the problem and, ultimately, to find a solution. :)

nmearl commented 8 years ago

Hooray! Great job.

lmfit / uncertainties

Summing unusably slow #30