Improved count matrix performance

franknoe commented 8 years ago

This PR mainly removes inefficiencies of transition matrix counting encountered in cases of many states or many trajectories. Additionally, some PEP coding violations were fixed and deprecated comments and TODO statements were removed.

In some cases (significant more than 1000 states or more than 10000 trajectories), transition matrix counting could have been really slow. msmtools.estimation.count_matrix is now using the new counting method msmtools.estimation.sparse.count_matrix.count_matrix_coo2_mult. To effect on performance is illustrated by following test code:

from msmtools.estimation.sparse.count_matrix import count_matrix_bincount_mult
from msmtools.estimation.sparse.count_matrix import count_matrix_coo_mult
from msmtools.estimation.sparse.count_matrix import count_matrix_coo2_mult

def count_performance(dtrajs, lag=1):
    import time
    nstates = msmtools.estimation.number_of_states(dtrajs)
    t1 = time.time()
    count_matrix_bincount_mult(dtrajs, lag)
    t2 = time.time()
    count_matrix_coo_mult(dtrajs, lag)
    t3 = time.time()
    count_matrix_coo2_mult(dtrajs, lag)
    t4 = time.time()
    print len(dtrajs), '\t', dtrajs[0].size, '\t', nstates, '\t', int(1000*(t2-t1)), '\t', int(1000*(t3-t2)), '\t', int(1000*(t4-t3))

print '[1]: count_matrix_bincount_mult'
print '[2]: count_matrix_coo_mult'
print '[3]: count_matrix_coo2_mult'
print
print 'ntraj\tlength\tnstates\t[1]\t[2]\t[3]'
print '----------------------------------------------'
# long trajectory
count_performance([np.random.choice(10, size=10000)])
count_performance([np.random.choice(10, size=100000)])
count_performance([np.random.choice(10, size=900000)])
count_performance([np.random.choice(100, size=10000)])
count_performance([np.random.choice(100, size=100000)])
count_performance([np.random.choice(100, size=900000)])
count_performance([np.random.choice(1000, size=10000)])
count_performance([np.random.choice(1000, size=100000)])
count_performance([np.random.choice(1000, size=900000)])
count_performance([np.random.choice(10000, size=10000)])
count_performance([np.random.choice(10000, size=100000)])
count_performance([np.random.choice(10000, size=900000)])
# many short trajectories
count_performance([np.random.choice(10, size=2) for i in range(10000)])
count_performance([np.random.choice(100, size=2) for i in range(10000)])
count_performance([np.random.choice(1000, size=2) for i in range(10000)])
# many short trajectories II
count_performance([np.random.choice(10, size=20) for i in range(10000)])
count_performance([np.random.choice(100, size=20) for i in range(10000)])
count_performance([np.random.choice(1000, size=20) for i in range(10000)])

Results (time in milliseconds) on my machine (MacOS 10.10.1, Intel i7@1.7 GHz, 8 GB Ram):

[1]: count_matrix_bincount_mult
[2]: count_matrix_coo_mult
[3]: count_matrix_coo2_mult

ntraj   length  nstates [1]     [2]     [3]
----------------------------------------------
1       10000   10      0       1       1
1       100000  10      2       11      9
1       900000  10      18      86      85
1       10000   100     0       1       0
1       100000  100     2       14      13
1       900000  100     16      85      89
1       10000   1000    32      1       1
1       100000  1000    36      11      10
1       900000  1000    116     131     136
1       10000   9997    3444    5       2
1       100000  10000   2926    6       8
1       900000  10000   3061    102     96
10000   2       10      139     3772    84
10000   2       100     403     4167    67
10000   2       1000    63678   4350    83
10000   20      10      148     4029    105
10000   20      100     443     4892    83
10000   20      1000    50849   10013   109

It is seen that while the new method does not always win, it never leads to an explosion of the runtime while the two previous methods could become really useless with big data. If there are no objections, I will remove the old counting methods in a subsequent PR.

coveralls commented 8 years ago

Coverage increased (+0.2%) to 81.619% when pulling 18eae05b20daffde772a423b59ee60bc925593f4 on count_matrix_performance into b5ac0f6682fb70c5fc83f595e3402c2cbef250ec on devel.

marscher commented 8 years ago

Nice improvement! I would also remove the obsolete code (maybe already in this PR)

franknoe commented 8 years ago

I think it's good to have one version on which this test code in the description of this PR works. Once I remove the deprecated methods, this won't work anymore.

marscher commented 8 years ago

You're right.

trendelkampschroer commented 8 years ago

Nice!

markovmodel / msmtools

Improved count matrix performance #83