tscrim commented 8 years ago

We improve the speed of methods like dict_addition in order to improve the speed of addition in CombinatorialFreeModule.

CC: @sagetrac-sage-combinat @nthiery

Component: performance

Keywords: combinatorial free module, addition, days79

Author: Travis Scrimshaw, Nicolas M. Thiéry

Branch/Commit: 95aacfa

Reviewer: Nicolas M. Thiéry, Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/20680

tscrim commented 8 years ago

Branch: public/combinat/improve_dict_addition-20680

tscrim commented 8 years ago

Commit: d5646a4

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

7373869 Speedup and improvement to dictionary addition and related methods.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Changed commit from d5646a4 to 7373869

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

f082104 Reviewer changes by Nicolas.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Changed commit from 7373869 to f082104

tscrim commented 8 years ago

comment:4

Hmmm...slightly worrisome that there is an ordering change to elements in that failing categories/finite_dimensional_algebras_with_basis.py test, to which this ticket should not have caused AFAICS. Will investigate.

tscrim commented 8 years ago

comment:5

So what the issue is is that the example finite monoid does not have an ordering of its elements. What I am thinking of doing is implementing a _cmp_ by using the _cmp_by_value of ElementWrapper. This will make this doctest far less fragile in the future (it's somewhat surprising it is not machine dependent as is), but it then becomes less of a minimal implementation.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Changed commit from f082104 to d2a704c

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

d2a704c 20680: define a total order on Monoids().Finite().example() for more deterministic tests

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Changed commit from d2a704c to 4e912ff

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 8 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

4e912ff Merge branch 'develop' into t/20680/public/combinat/improve_dict_addition-20680

jdemeyer commented 8 years ago

comment:8

Is there anything combinatorics-specific to dict_addition.pyx? If not, it should not be in the combinat directory. Perhaps move to data_structures?

jdemeyer commented 8 years ago

Work Issues: Move from combinat

nthiery commented 7 years ago

comment:10

Here are some benchmarks, first with plain Sage 5.3, then with Travis's optimizations, and then after the code refactorization of commit ee10e70daec5704b4f2c8f63c2cf2e3c525e9311. Sorry that's a bit long; the timings fluctuate quite some which makes it painful to do.

Rough interpretation: with Travis optimization the speed gain fluctuates between 1 and 3, and the refactorization does no harm, and could even improve slightly the situation.

Protocol

For each n=0,1000,2000,4000, we time the addition of two dictionaries of length n. Three times for overlapping dictionaries, three times for half overlapping dictionaries, three times for non overlapping dictionaries.

    def Xn(n): return { i:i for i in range(n) }
    def Yn(n): return { i:i for i in range(n,2*n) }
    def Zn(n): return { i:i for i in range(2*n,3*n) }

Sage 5.3

Setup:

    sage: blasadd = sage.combinat.dict_addition.dict_addition

Running the tests for a given n:

    X = Xn(n); Y = Yn(n); Z = Zn(n)
    %timeit blasadd([X,X])
    %timeit blasadd([X,X])
    %timeit blasadd([X,X])
    %timeit blasadd([X,Y])
    %timeit blasadd([X,Y])
    %timeit blasadd([X,Y])
    %timeit blasadd([X,Z])
    %timeit blasadd([X,Z])
    %timeit blasadd([X,Z])

Timings:

    n = 0
    1000000 loops, best of 3: 1.12 µs per loop
    1000000 loops, best of 3: 662 ns per loop
    1000000 loops, best of 3: 684 ns per loop

    1000000 loops, best of 3: 646 ns per loop
    1000000 loops, best of 3: 1.04 µs per loop
    1000000 loops, best of 3: 941 ns per loop

    1000000 loops, best of 3: 918 ns per loop
    1000000 loops, best of 3: 705 ns per loop
    1000000 loops, best of 3: 670 ns per loop

    n = 1000
    1000 loops, best of 3: 365 µs per loop
    1000 loops, best of 3: 405 µs per loop
    1000 loops, best of 3: 402 µs per loop

    1000 loops, best of 3: 462 µs per loop
    1000 loops, best of 3: 212 µs per loop
    1000 loops, best of 3: 376 µs per loop

    1000 loops, best of 3: 465 µs per loop
    1000 loops, best of 3: 481 µs per loop
    1000 loops, best of 3: 467 µs per loop

    n=2000
    1000 loops, best of 3: 835 µs per loop
    1000 loops, best of 3: 369 µs per loop
    1000 loops, best of 3: 785 µs per loop

    1000 loops, best of 3: 504 µs per loop
    1000 loops, best of 3: 968 µs per loop
    1000 loops, best of 3: 561 µs per loop

    1000 loops, best of 3: 618 µs per loop
    1000 loops, best of 3: 675 µs per loop
    1000 loops, best of 3: 1.02 ms per loop

    n=4000
    1000 loops, best of 3: 1.11 ms per loop
    1000 loops, best of 3: 1.19 ms per loop
    1000 loops, best of 3: 1.17 ms per loop

    1000 loops, best of 3: 1.82 ms per loop
    1000 loops, best of 3: 1.67 ms per loop
    1000 loops, best of 3: 1.5 ms per loop

    100 loops, best of 3: 2.02 ms per loop
    100 loops, best of 3: 1.1 ms per loop
    1000 loops, best of 3: 1.53 ms per loop

Travis's optimization

Setup:

    blasadd = sage.combinat.dict_addition.dict_add

Running the tests for a given n:

    X = Xn(n); Y = Yn(n); Z = Zn(n)
    %timeit blasadd(X,X)
    %timeit blasadd(X,X)
    %timeit blasadd(X,X)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Z)
    %timeit blasadd(X,Z)
    %timeit blasadd(X,Z)

Timings:

    n = 0:
    1000000 loops, best of 3: 677 ns per loop
    1000000 loops, best of 3: 574 ns per loop
    1000000 loops, best of 3: 1.01 µs per loop

    1000000 loops, best of 3: 985 ns per loop
    1000000 loops, best of 3: 640 ns per loop
    1000000 loops, best of 3: 635 ns per loop

    1000000 loops, best of 3: 664 ns per loop
    1000000 loops, best of 3: 489 ns per loop
    1000000 loops, best of 3: 600 ns per loop

    n = 1000: ~.3ms
    1000 loops, best of 3: 357 µs per loop
    1000 loops, best of 3: 353 µs per loop
    1000 loops, best of 3: 373 µs per loop

    1000 loops, best of 3: 282 µs per loop
    1000 loops, best of 3: 283 µs per loop
    1000 loops, best of 3: 259 µs per loop

    1000 loops, best of 3: 302 µs per loop
    1000 loops, best of 3: 315 µs per loop
    1000 loops, best of 3: 313 µs per loop

    n = 2000: .7ms
    1000 loops, best of 3: 787 µs per loop
    1000 loops, best of 3: 778 µs per loop
    1000 loops, best of 3: 747 µs per loop

    1000 loops, best of 3: 175 µs per loop
    1000 loops, best of 3: 343 µs per loop
    1000 loops, best of 3: 256 µs per loop

    1000 loops, best of 3: 682 µs per loop
    1000 loops, best of 3: 330 µs per loop
    1000 loops, best of 3: 662 µs per loop

    n = 4000: 1.2ms
    1000 loops, best of 3: 1.15 ms per loop
    1000 loops, best of 3: 1.46 ms per loop
    1000 loops, best of 3: 1.2 ms per loop

    1000 loops, best of 3: 1.11 ms per loop
    1000 loops, best of 3: 1.09 ms per loop
    1000 loops, best of 3: 800 µs per loop

    1000 loops, best of 3: 985 µs per loop
    1000 loops, best of 3: 915 µs per loop
    1000 loops, best of 3: 1.08 ms per loop

After blas-style refactorization

Setup:

    blasadd = sage.data_structures.blas_dict.add

Running the tests for a given n:

    X = Xn(n); Y = Yn(n); Z = Zn(n)
    %timeit blasadd(X,X)
    %timeit blasadd(X,X)
    %timeit blasadd(X,X)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Y)
    %timeit blasadd(X,Z)
    %timeit blasadd(X,Z)
    %timeit blasadd(X,Z)

Timings::

    n=0: ~.2ms
    1000000 loops, best of 3: 578 ns per loop
    1000000 loops, best of 3: 603 ns per loop
    1000000 loops, best of 3: 227 ns per loop

    1000000 loops, best of 3: 226 ns per loop
    1000000 loops, best of 3: 391 ns per loop
    1000000 loops, best of 3: 572 ns per loop

    1000000 loops, best of 3: 566 ns per loop
    10000000 loops, best of 3: 515 ns per loop
    1000000 loops, best of 3: 217 ns per loop

    n=1000: ~.3ms
    10000 loops, best of 3: 299 µs per loop
    1000 loops, best of 3: 351 µs per loop
    1000 loops, best of 3: 334 µs per loop

    1000 loops, best of 3: 256 µs per loop
    1000 loops, best of 3: 255 µs per loop
    1000 loops, best of 3: 258 µs per loop

    1000 loops, best of 3: 78.1 µs per loop
    1000 loops, best of 3: 279 µs per loop
    1000 loops, best of 3: 283 µs per loop

    n = 2000: ~.6s
    1000 loops, best of 3: 704 µs per loop
    1000 loops, best of 3: 709 µs per loop
    1000 loops, best of 3: 697 µs per loop

    1000 loops, best of 3: 174 µs per loop
    1000 loops, best of 3: 334 µs per loop
    10000 loops, best of 3: 502 µs per loop

    1000 loops, best of 3: 204 µs per loop
    1000 loops, best of 3: 612 µs per loop
    1000 loops, best of 3: 591 µs per loop

    n = 4000: ~1 ms
    1000 loops, best of 3: 1 ms per loop
    1000 loops, best of 3: 960 µs per loop
    1000 loops, best of 3: 903 µs per loop

    1000 loops, best of 3: 658 µs per loop
    1000 loops, best of 3: 1.04 ms per loop
    1000 loops, best of 3: 723 µs per loop

    1000 loops, best of 3: 1.15 ms per loop
    1000 loops, best of 3: 1.16 ms per loop
    1000 loops, best of 3: 984 µs per loop

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 4e912ff to 300b5e7

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

`ee10e70`	`20680: refactored the internal dict_* methods of (Combinatorial)FreeModule with a BLAS-style API`
`18dcef3`	`Merge branch 'develop' into t/20680/public/combinat/improve_dict_addition-20680`
`bff94d1`	`20680: typo fix`
`300b5e7`	`20680: micro bug fix (wrong argument order)`

nthiery commented 7 years ago

comment:12

Replying to @jdemeyer:

Is there anything combinatorics-specific to dict_addition.pyx? If not, it should not be in the combinat directory. Perhaps move to data_structures?

I agree. I'll chat with Travis here for whether we do it now or in a later ticket.

nthiery commented 7 years ago

comment:13

After discussion with Travis:

We will move it now
A bit of hesitation between sage.modules and sage.data_structures; Travis is more in favor of the latter.

Any opinion?

I'll fix some remaining doctest failures in the mean time.

jdemeyer commented 7 years ago

comment:14

I also mentioned earlier that data_structures seems like a better choice.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

`7a8fd3a`	`20680: update other locations where dict_addition and friends were used.`
`c858683`	`20680: fix bug in the copy-optimization which caused input of axpy to be mutated.`

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 300b5e7 to c858683

nthiery commented 7 years ago

comment:16

Replying to @jdemeyer:

I also mentioned earlier that data_structures seems like a better choice.

Your comment was about data_structures versus combinat, when we are hesitating between modules and data_structures :-)

I'll do the move this afternoon.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from c858683 to be78e1f

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

be78e1f 20680: refactored the logic of dict_addition.iaxpy for speed (hopefully) and compactness

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

`9d053f6`	`20680: move sage.combinat.dict_addition -> sage.data_structures.blas_dict: first step`
`cbad8d4`	`20680: move sage.combinat.dict_addition -> sage.data_structures.blas_dict: second step`

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from be78e1f to cbad8d4

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

7749836 20680: updated the deprecation aliases w.r.t. the previous move and documented that deprecation

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from cbad8d4 to 7749836

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 7749836 to 07b303e

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

07b303e 20680: UTF-8 fix + note about Python3 compatibility

jdemeyer commented 7 years ago

comment:22

Obvious design question: why don't you implement this as a subclass of dict such that you could actually write a * D instead of scal(a, D)?

I understand that it might be more work this way. It does seem the most natural thing to do and it would result in more readable and more efficient code.

jdemeyer commented 7 years ago

comment:23

This should be avoided

__cmp__ = ElementWrapper._cmp_by_value

as it will not be supported in Python 3.

jdemeyer commented 7 years ago

comment:24

Cython knows how to copy dicts efficiently, so you can replace PyDict_Copy(D) by D.copy() (assuming that D is declared as dict).

jdemeyer commented 7 years ago

comment:25

I think you should also specify exactly what mathematical assumptions you make on K. For example, you assume that bool(x) implies bool(-x) and you assume that -1 * x = -x.

jdemeyer commented 7 years ago

comment:26

This is false:

.. TODO::

    Upon migrating to Python 3, change .iteritems below to .items. We
    don't want to do it now as this is a speed-critical location.

Cython supports .iteritems() for objects typed as dict, so you can just keep .iteritems() regardless of Python version.

jdemeyer commented 7 years ago

comment:27

I think the remove_zeros flag of iaxpy is too confusing. You don't really define what happens if removes_zeros=False (which keys would appear with a zero value?). I guess that every key which appears in X or Y should appear in the result, but that is not currently the case.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 07b303e to 9570c65

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

9570c65 Doing some changes asked for by Jeroen and some other little doc tweaks.

tscrim commented 7 years ago

comment:30

comment:22 - This would be quite difficult to do because we would have to handle a * D and D * a, which would take more time because they would both pass through _mul_. Furthermore, we couldn't do a*D + E in one single operation.

comment:23 - I think this was needed at the time due to the absence of coercion for comparisons for ElementWrapper.

comment:24 - Done.

comment:25 - I took a crack at it. Let me know if you still think it is unclear.

tscrim commented 7 years ago

Changed work issues from Move from combinat to none

jdemeyer commented 7 years ago

comment:31

Replying to @tscrim:

comment:22 - This would be quite difficult to do because we would have to handle a * D and D * a, which would take more time because they would both pass through _mul_.

More precisely, _mul_ would not be involved since D would not be an Element. But the coercion model would be involved. If the coercion model is too slow for these purposes, we really should fix that. It makes no sense to make code more complicated just to avoid the coercion model.

Furthermore, we couldn't do a*D + E in one single operation.

True, but then you make a method to do that in a single operation.

jdemeyer commented 7 years ago

comment:32

Regarding [comment:27], I think the phrase "values are zero after the addition has been performed" is still not clear. Do you mean values such that a * x is non-zero and y is non-zero but a * x + y is zero? If so, what is the rationale for keeping those zeros but not the cases where y is zero and a * x is zero?

jdemeyer commented 7 years ago

comment:33

Replying to @tscrim:

comment:22 - This would be quite difficult to do because we would have to handle a * D and D * a, which would take more time because they would both pass through _mul_. Furthermore, we couldn't do a*D + E in one single operation.

OK, new try: how about subclassing dict and making all operations methods of that class? That would avoid the inefficiency issues with the coercion model.

jdemeyer commented 7 years ago

comment:34

(Edit: wrong ticket)

tscrim commented 7 years ago

comment:35

Replying to @jdemeyer:

Replying to @tscrim:

comment:22 - This would be quite difficult to do because we would have to handle a * D and D * a, which would take more time because they would both pass through _mul_.

More precisely, _mul_ would not be involved since D would not be an Element. But the coercion model would be involved. If the coercion model is too slow for these purposes, we really should fix that. It makes no sense to make code more complicated just to avoid the coercion model.

Sorry, that should have been __mul__, but you still need to take some cycles differentiating between a * D and D * a in Cython. At least, I could not get __radd__ on a Cython class (different ticket), and I'm assuming that extends to __rmul__.

Replying to @jdemeyer:

OK, new try: how about subclassing dict and making all operations methods of that class? That would avoid the inefficiency issues with the coercion model.

If we make all operations to be methods of that class, then all I see is unneeded complexity added because we'd still be making (essentially) function calls everywhere, but we have to differentiate between dict and BLASdict.

tscrim commented 7 years ago

comment:36

Replying to @jdemeyer:

Regarding [comment:27], I think the phrase "values are zero after the addition has been performed" is still not clear. Do you mean values such that a * x is non-zero and y is non-zero but a * x + y is zero? If so, what is the rationale for keeping those zeros but not the cases where y is zero and a * x is zero?

I think you're misparsing something. The word value is as in key-value pair of a dictionary (because everything we are doing is dictionaries). In some cases, someone might want a basis element that has a coefficient of 0. For instance, it takes longer to remove these coefficients, and if you're doing a lot of additions with full support, you may only want to remove the basis elements with a 0 coefficient after you're all done.

tscrim commented 7 years ago

comment:37

Actually, slightly radical proposal: How about removing this altogether as a separate spkg since it is independent of Sage?

jdemeyer commented 7 years ago

comment:38

Replying to @tscrim:

Actually, slightly radical proposal: How about removing this altogether as a separate spkg since it is independent of Sage?

Even if you do that, you could still develop it within Sage and then split it off.

jdemeyer commented 7 years ago

comment:39

Replying to @tscrim:

Replying to @jdemeyer:

Regarding [comment:27], I think the phrase "values are zero after the addition has been performed" is still not clear. Do you mean values such that a * x is non-zero and y is non-zero but a * x + y is zero? If so, what is the rationale for keeping those zeros but not the cases where y is zero and a * x is zero?

I think you're misparsing something.

If I am misparsing something, it probably means that the documentation wasn't clear. My question remains: if remove_zeros=False, exactly which keys will appear with a zero value?

sagemath / sage

Some optimizations for addition in combinatorial free modules and dict_* methods #20680

Protocol

Sage 5.3

Travis's optimization

After blas-style refactorization