msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

Feature Request: Hooks to modify states #68

Closed synapticarbors closed 11 years ago

synapticarbors commented 12 years ago

This is a feature request to add hooks in MSMBuilder so that the state space can be manipulated without modifying the existing code.

When version 2.0 was released on the SimTK website, I hacked in a number of project-specific features including using environmental variables to select the distance metric and some methods to modify the state space to remove states beyond what was done with the ergodic trimming. This involved monkey patching MSMLib and rewriting a number of the scripts to call the custom version of the code. I see that in version 2.5 it is much easier to use custom distance metrics. I think it would be quite useful to extend the plugin mechanism and add hooks throughout the code to allow users to modify the count matrix and state space in a similar way.

rmcgibbo commented 12 years ago

Thanks for filing this! I think this is a nice idea, but I'm still a little hazy on what it entails. Can you be a little more specific on how you'd like to modify the count matrix / state space?

If we do a plugin architecture here, then basically we need to decide what the API for the plugins are and where they're called? Are you thinking of something that gets called like right after ergodic trimming? Should it take in just the count matrix, or does it need geometry information as well?

synapticarbors commented 12 years ago

Hi Robert - Basically there are regions of conformational space that I know are not sampled well and are visited infrequently, but are not trimmed by the ergodic trimming mechanism because they are fully connected. Specifically, I'm looking at ligand binding so outside some radius of the binding site or volume, I just want to trim those states. So I need access to the geometry information of where all of the generators are, the count matrix, and then the ability to modify the mapping. This would be called after the ergodic trimming. There would also need to be a way to get a couple of parameters into the method that are specified on the command line to define the volume.

The way I accomplished this in the earlier version of MSMbuilder was to write a custom copy of a script (e.g. BuildMSM_x.py) that was basically a copy of the original, but then had something like the following at the top of the file:

from msmbuilder import Serializer,MSMLib
from msmbuilder import CustomMSMLib,Project,Trajectory

# Monkey patch MSMLib method to use custom calculation of count matrix
MSMLib.GetCountMatrixFromAssignments = CustomMSMLib.CustomGetCountMatrixFromAssignments 

where CustomGetCountMatrixFromAssignments looks something like:

def CustomGetCountMatrixFromAssignments(Assignments,NumStates=None,LagTime=1,Slide=True):
    """Calculate count matrix from Assignments."""

    ....

    if not NumStates:
        NumStates = 1 + int(np.max([np.max(a) for a in Assignments]))   # Lutz: a single np.max is not enough, b/c it can't handle a list of 1-d arrays of different lengths
        assert NumStates >= 1

    C=scipy.sparse.lil_matrix((int(NumStates),int(NumStates)),dtype='float32')  # Lutz: why are we using float for count matrices?

    for A in Assignments:
        FirstEntry=np.where(A!=-1)[0]
        if len(FirstEntry)>=1:#New Code by KAB to skip pre-padded negative ones.  This should solve issues with Tarjan trimming results.
            FirstEntry=FirstEntry[0]
            A=A[FirstEntry:]
            C=C+MSMLib.GetTransitionCountMatrixSparse(A,NumStates,LagTime=LagTime,slidingwindow=Slide)#.tolil()

    # Trim states outside of cylindrical volume
    try:
        C,Mapping = trim_states(C) 
        MSMLib.ApplyMappingToAssignments(Assignments,Mapping)

    except NotImplementedError:
        print('WARNING: trim_states not implemented')
    except:
        raise

    return(C)

and trim_states is defined elsewhere in CustomMSMLib. This is not ideal because for any script that requires this, I need to make a copy of the script and monkey patch it.

rmcgibbo commented 12 years ago

I'm going to think about this for a bit. I'm not quite sure what the "right" way -- software engineering wise -- to do this is.

tjlane commented 12 years ago

hi synapticarbors,

We were just chatting about how to accommodate this. Would it satisfy your needs if we implemented a generalized trimming function that takes a list of states to trim and returns a trimmed counts/assignments? My vision is that you'd get a list (states_to_trim) and then be able to do something like

trimmed_counts, trimmed_assignments = trim_states( states_to_trim, counts, assignments )

maybe with assignments as an optional arg.

Then we could change ergodic trimming to return an appropriate list of states_to_trim -- specifically the indices of all the states that should be discarded to ensure that we keep only the strongly connected subgraph of the original data.

You could then apply your method by first evaluating your metric of choice on all your msm states, then calling states_to_trim = np.where( metric_values > cutoff ), and passing that list to the trim_states function above. If someone wanted to trim many states, they could just do states_to_trim = list1 + list2 + ... and then pass the concatenated list.

Coding this API is pretty trivial. Making a general script is also easy, but we should think a little about how to make it as useful as possible.

Is something like this what you had in mind?

synapticarbors commented 12 years ago

Hi TJ - that sounds reasonable. The main thing is I want to be able to apply the trimming without having to modify or monkey patch any of the MSMBuilder internals. Having an additional step in the workflow where I have to write a custom script to determine which states need to be trimmed is fine though as long as the results can be easily applied back before any Markov State analysis is run.

Josh

rmcgibbo commented 12 years ago

I'm not sure if we're going to be able to do this without modifying the msmbuilder internals.

We do have the pkg_resources plugin discovery framework, which is being used for GPU_RMSD and LPRMSD. I think that be more trouble than it's worth though.

tjlane commented 12 years ago

Hey Josh,

I just implemented exactly what I described above into MSMLib. Check out the new code and let me know if it (1) works to fulfill your needs and (2) is bug-free :). It passed the few tests I threw at it.

Feel free to re-open the issue if you notice something.

TJ

synapticarbors commented 12 years ago

Hi TJ,

I haven't been able to sit down with this for a while, but I'm hoping to in the next week. I wanted to run a modification of the code by you that, which I'm planning on implementing and then submitting a pull request. While the code you implemented is a start, the problem is that if I want to run a script like BuildMSM.py, I would still need to make a custom version of it that calls trim_states. The route that I'm thinking about taking is:

  1. For BuildMSM.py and CalculateImpliedTimescales.py, add an additional command line argument that passes in a file of auxiliary states to trim that is just a text file of state indices (sort of like an atomindices file).
  2. Anywhere that ergodic_trim is called in the code, instead call ergodic_trim_indices, collect the state indices and then calculate the union of that list and the list of misc user-defined states passed in from the command line.
  3. Pass the complete list of states to trim to trim_states.

If there are no user-defined auxiliary states to trim then the code will behave as before, but it adds some flexibility for users to trim additional states if necessary.

Let me know if this plan sounds ok and would be something that you guys would merge into the code base if I submit it.

Josh

tjlane commented 12 years ago

I think that sounds perfectly reasonable. So long as the default behavior is just like before, I think we're happy with exposing additional functionality -- go for it!

FYI I recently added a flag to those scripts that allows the user to skip the ergodic trimming step, per the request of another user. If you just re-appropriate the argparse lines for that to do what you describe above (default: ergodic trim, other options: no trimming, or pass a list of states to trim), we won't even increase the complexity of the user experience.

Looking forward to seeing your PR!

synapticarbors commented 12 years ago

TJ - a couple of quick questions because a couple of things don't make sense in the code, and I'm wondering if they are typos or untested bugs (or me just not getting what is being done):

  1. In the trim_states method you wrote, there is a line
apply_mapping_to_assignments(trimmed_assignments, Mapping) # nefarious in-place

Previous to that line there is a variable called mapping, but Mapping does not appear. This appears to be a typo and is picked up as an error by pyflakes. It looks like a copy-and-paste error with code entering from a different method using a different capitalization scheme.

  1. The logic in ergodic_trim_indices and ergodic_trim seems flipped. They both use the same basic procedure to get the indices of a set of states, but in the former these are marked as states to trim and in the latter, they are named GoodComponent and seem to be the states that are retained. trim_states seems to follow the logic of ergodic_trim_indices, which looks incorrect to me. It looks to me like you are actually retaining the states that are being passed in as the states to trim and throwing away the others. The way the test is written, because of the flipped logic in trim_states and ergodic_trim_indices, it still passes.
tjlane commented 12 years ago

Good spot.

I think you are right I had things backwards -- if you look at the function ergodic_trim() my comment up here makes it clear that I had them mixed up in my mind. Also you're right that I did copy/paste the code, and didn't write the original code, which is super hard to understand.

Regardless this is a bug (and a bad one) that I'll fix momentarily. Thanks for the careful eye!

tjlane commented 12 years ago

Josh,

Wanna take a look at the above commit and tell me if everything looks good to you?

TJ

synapticarbors commented 12 years ago

TJ - I haven't tested the code, but a few general comments/observations:

Mapping = np.zeros(counts.shape[0], dtype='int') - 1
    for i, x in enumerate(GoodComponent):
        Mapping[x] = i
tjlane commented 12 years ago

I think items 1,3 above aren't a big deal. I had the opposite intuition for 3, so meh.

Item 2 is a problem, and one that should be addressed. Didn't fix it last night b/c I just wanted to clean up the bug you pointed out w/o changing any call signatures.

Want to fix that in your PR? Is trivial change.

synapticarbors commented 12 years ago

Sure, I'll take care of it in my PR. I just wanted to check before I changed an interface since I'm not a core developer. Also just to double check, there is no reason to return the assignments array since that is modified in-place, correct? It looks like it is sufficient just to return the trimmed count matrix and the state mapping as ergodic_trim was doing originally.

tjlane commented 11 years ago

@synapticarbors we're in the process of rounding up old issues before a release -- did this ever get sufficiently resolved for you? I don't recall you ever submitting a PR but I could be wrong ( @rmcgibbo ?)

synapticarbors commented 11 years ago

Hi TJ - As with a lot of things, I got side tracked on this one, so there was never a pull request. Since this isn't urgent you can probably go ahead and close this. I was looking to have (or create) a more general plugin mechanism, but for expediency, I can probably make due with what is currently in the code.

rmcgibbo commented 11 years ago

I'll close this for now, but we're still very interested in making these types of hooks/plugins as easy as possible. @synapticarbors, if you have any other feature requests (or just gripes about usability, error reporting, etc) on msmbuilder, now is a great time to chime in. We're currently (starting in the last 2 weeks or so) planning the development roadmap towards msmbuilder 3.

Some of the goals that we have are compiled on the github wiki, here