Covar - Githubissues

franknoe commented 9 years ago

This PR implements a first attempt for moments calculation on data (sums and covariances) that exploit sparsity and constant features in the data.

variational.estimators.moments contains functions to compute sums and covariances on a data matrix X or on two data matrices X and Y. Enables symmetric computation, substraction of mean and exploits sparsity. Different numerical input formats (float64, float32, int, bool) are tolerated, but internally we will work with float64 or in rare cases with float32 in order to be both numerically accurate and fast (BLAS).
variational.estimators.moments also contains a pair-wise single-sweep merge class for sums and covariances computed on data chunks.
variational.estimators.tests.benchmark_moments has performance tests. Just run the file and see.

Ready for review and merge, but still needs some improvements to be ready for operation.

@fnueske and @fabian-paul : please review

franknoe commented 9 years ago

@fnueske can you review this and merge if OK?

marscher commented 9 years ago

Some comments:

Except for the "performance" test, there is no single unit test checking for the right results.
versioneer is being used, but not installed properly in the repo (use versioneer-installer).
a lot of debug statements are still present in the code (but commented out).

franknoe commented 9 years ago

Note: this package is not being released (not even as 0.something), it's clear that this doesn't have any acceptable standards yet. Feliks, Fabian and I are just collecting and discussing code at this stage.

Am 18/11/15 um 12:53 schrieb Martin K. Scherer:

Some comments:

Except for the "performance" test, there is no single unit test checking for the right results.

I have unit tests in my current working copy but need these changes to be merged

versioneer is being used, but not installed properly in the repo (use versioneer-installer).

I don't know how to do this.

a lot of debug statements are still present in the code (but commented out).

Yes. See above.

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/variational/pull/13#issuecomment-157688925.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

fnueske commented 9 years ago

Yes, I'll have a look in the afternoon today.

Am 18.11.15 um 12:35 schrieb Frank Noe:

@fnueske https://github.com/fnueske can you review this and merge if OK?

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/variational/pull/13#issuecomment-157685143.

franknoe commented 9 years ago

I've added a Travis CI service. @marscher can you have a look if the config is OK? I am not sure what I should insert as "Token" and "Domain", so I have simply left this empty.

franknoe commented 9 years ago

BTW, there are unit tests in here: test_moments.py (last file of the commit)

marscher commented 9 years ago

Didn't spot that. To make travis work, one needs a travis.yaml file in the repo, which describes how to build and test the software.

franknoe commented 9 years ago

OK, I'll try to copy and adapt the PyEMMA one.

Am 18/11/15 um 14:08 schrieb Martin K. Scherer:

Didn't spot that. To make travis work, one needs a travis.yaml file in the repo, which describes how to build and test the software.

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/variational/pull/13#issuecomment-157706974.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

fnueske commented 9 years ago

Hey Frank,

going through everything I can evaluate, I have just some small questions left:

When trying to understand the update formula for the weights in variational.estimators.moments.MomentsStorage.Store, I get lost. Can you explain this formula to me at some point? But I guess it has been used many times before?
What is the role of rtol in variational.estimators.moments.MomentsStorage? I don't really understand it.
I can't really say anything about the C-code. Maybe someone else should have a look at it.
Is there a way to run the tests directly on github? Or do I have to download the new branch first?

Just for curiosity:

In variational.estimators.moments.momentsXX: why do we switch to float32 if the number of time steps is smaller than 2**23 and to float64 otherwise? Where does this number come from?
Why can we not use the fast subtract_row routine if the array is sparse but not a copy of the original array? What is the problem in this case?

franknoe commented 9 years ago

Am 18/11/15 um 17:16 schrieb Feliks Nüske:

Hey Frank,

going through everything I can evaluate, I have just some small questions left:

Sure, I'll take the opportunity to add comments to the code - but I'll do this in the next PR, because I have already changed things since then.

When trying to understand the update formula for the weights in variational.estimators.moments.MomentsStorage.Store, I get lost. Can you explain this formula to me at some point? But I guess it has been used many times before?

Its in this paper: http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

What is the role of rtol in variational.estimators.moments.MomentsStorage? I don't really understand it.

It's to decide when to merge two Moments. Ideally I'd like to merge two Moments when they have equal weights (i.e. equally many data points went into them). If we always add data chunks with equal weights, this can be achieved by using a binary tree, i.e. let M1 be the moment estimates from one chunk. Two of them are added to M2, Two M2 are added to M4, and so on. This way you need to store log2 (n_chunks) number of Moment estimates. Now in practice you might get data in chunks of unequal length or weight. Therefore we need some heuristic when two Moment estimates should get merged. This is the role of rtol.

I can't really say anything about the C-code. Maybe someone else should have a look at it.

No need for now, it's pretty simple. I can explain you what's going on there.

Is there a way to run the tests directly on github? Or do I have to download the new branch first?

Just for curiosity:

In variational.estimators.moments.momentsXX: why do we switch to float32 if the number of time steps is smaller than 2**23 and to float64 otherwise? Where does this number come from?

float32 has 23 bits for its base. Thus numbers up to 2^23 can be definitely represented with float32 without rounding error. When using boolean arrays (1/0), the largest possible value that can occur after a matrix product is T, i.e. the length of the scalar products. We use floats instead of integers because the BLAS matrix product implementation for floats is much faster than that for ints.

Why can we not use the fast subtract_row routine if the array is sparse but not a copy of the original array? What is the problem in this case?

In that case we use a sliced representation [:, cols]. This is just a view of the array, not a true copy. So whenever accessing the array we must know the indexes. Fine for Python, but if we work in C, we would need to pass the indexing information too, because we don't have the view available there. One could in principle do this with more complicated C functions, but it probably wouldn't be Cache-efficient anymore, so it's just not worth doing it.

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/variational/pull/13#issuecomment-157764753.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

markovmodel / variational

Covar #13

Mail: Arnimallee 6, 14195 Berlin, Germany

Mail: Arnimallee 6, 14195 Berlin, Germany

Mail: Arnimallee 6, 14195 Berlin, Germany