markovmodel / deeptime

Deep learning meets molecular dynamics.
GNU Lesser General Public License v3.0
175 stars 39 forks source link

Estimating Koopman Operator with data from multiple trajectories #23

Closed niklastoe closed 6 years ago

niklastoe commented 6 years ago

Hi,

in the Alanine dipeptide example with multiple files , pred_ord is calculated for total_data_source, i.e. for all trajectories. pred_ord is passed on to vamp.get_its and vamp.get_ck_test, both of which call vamp.estimate_koopman_op. pred_ord does not contain any information about the trajectories so vamp.estimate_koopman_op will consider pairs across trajectories when calculating c_tau e.g. traj1[-1] with traj2[tau-1]. These pairs are random since there's no connection between the trajectories.

In the example it's not a big issue since tau is much smaller than the length of each individual trajectory. Unless the random pairs correspond to transitions that are extremely rare, the Koopman operator won't differ much.

If that's the case though, what should one do? Estimate the Koopman operator for each individual trajectory and average all of them? I did that for the example, the transition probabilities involving the sparsely populated state 3 change a little bit.

image image

Out of curiosity, is there any explanation for the negative values in the matrix or is that a numerical issue?

In other scenarios where tau might be in the range of 10^1 ns and trajectory lengths are in the range of 10^2 ns there would be a considerable number of random pairs and a large deviation in the Koopman operator.

Please let me know what you think. I assume I'll be working with trajectories that are 5-10 tau long so I'd like a convenience function that takes care of this. If you agree with the overall thought and think that averaging over all the Koopman operators from different trajectories is fine, I'll go ahead and write it.

Niklas

amardt commented 6 years ago

Hi, you are right, this is an issue. And averaging over it is definitely not the right solution. Instead you should just not include the datapoints crossing the trajectories into your estimation. We will adapt our function for this case as soon as possible.

The negative value issue: The Koopman operator can have negative values. You cannot interpret it as a probability transition matrix as for a real Markov State Model. The row sum is still 1 though. But usually this is not really an issue. You can still interpret the output of your network as a fuzzy clustering and use the eigenvalues of the Koopman matrix for calculating the implied timescales. For non-reversible and non-stationary systems the last point breaks down though, because the eigenvalues will become complex. But the Koopman model still holds and is able to describe the dynamical system (via the singular values/functions).

If you want to have a valid transition matrix, have a look at our paper: Deep Generative Markov State Models This solves this issue and also shows you how you can construct a generative model which approximates the conditional probability distribution in configuration space. By this you can simulate a new trajectory with unseen configurations based on the conditional distribution your model learned.

Hope this helps! Otherwise just contact us!

Best Andreas

niklastoe commented 6 years ago

Thank you very much for the quick reply and the explanation! I'll read the paper.

Best, Niklas

amardt commented 6 years ago

Hey, we now added the multiple trajectories option for the Koopman matrix estimation. It needs them as a list though. See our Alanine-Dipeptide example with multiple files.

If you need clarification, just ask!

Best Andreas

niklastoe commented 6 years ago

Hi,

Great, thank you!

I think it should be

for l, length_i in enumerate(traj_lengths - tau):

instead of

for l, length_i in enumerate(traj_lengths):

to avoid mixing trajectories since pred_ord is of length sum(traj_lengths) - len(traj_lengths) * tau. In the example, traj_list[0].shape[0] == 250000 but traj_list[-1].shape[0] == 249976.

Best, Niklas

amardt commented 6 years ago

Hey, yes you are absolutely right! Thank you! I changed the code a bit different though: With your solution you would for each trajectory loose tau datapoints for your estimation (And in your example from above this could be many). In the example we reorganize the prediction of the network to have a state classification for each data point (3 * 250 000). Then you don't have to changed the code you were pointing at and you don't loose that many data points.

Best,

Andreas

niklastoe commented 6 years ago

Right, that's much better!