markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources
GNU Lesser General Public License v2.1
18 stars 7 forks source link

Reduced Trajectories #23

Closed jhprinz closed 7 years ago

jhprinz commented 7 years ago

Allow trajectories to be saved by a frame-reduced full-atom trajectory + a atom-reduced full-frame trajectory.

This changes a Trajectory or its subclass to be a group of files. These should be handled like an ordinary file, but a name change will alter all files in the same way.

This needs some thought.

But finally you will have a ReducedTrajectory that is a group of 3 files.

  1. Restart
  2. frame-reduced traj
  3. atom-reduced traj

And a RestartTrajectory just

  1. Restart
  2. full trajectory

Or whatever combination you want. Copying, etc will work on the whole group and I think it makes sense, to use one filename + multiple extensions Instead of multiple unique filenames.

jhprinz commented 7 years ago

Well, I decided to implement this in the following way (~ 95% done). Please comment

So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file.

The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly.

This behavior works but can lead to errors and requires a Task writer to know about the structure of a Trajectory and cover all the possible cases. Since we do not want the user to have a knowledge about the actual implementation of trajectories (unless he write a specific engine) we could propose multi-file File objects.

Multi-File

This is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like

00000000.dcd
00000000.dcd.restart
00000000.dcd.reduced

If you copy or move such a multi file, you will immediately copy and move all the files.

The MultiFile object has meta information and knows the basename and a list of extensions. THis also saves space in saving multiple related names. You can access the other locations using an attribute

file = Trajectory(...)
print file.restart  # `Location()`
print file.reduced  # `Location()`

It seems in general useful to treat some file groups as one, so I would add this to the File class.

How to treat trajectories

The main question to be answered it, whether we hard-code the reduced capabilities into the Trajectory object or a sub-class), or do we let the user deal with it.

1. Question

Assume you have some simple trajectories and some that use reduced (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the PyEMMAAnalysis?

Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc...

2. Question

When you call

traj = Trajectory()
frame = traj[20]

what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory.

3. Question

If we say that a ReducedTrajectory object exists, is the specific reduction something that is also fixed for a particular engine or is that an option of the task generator.

# engine attribute approach (recommended)
engine = OpenMMEngineReduced(
    ..., 
    selecion='{some selection str}', 
    time_restart_stride=200,  # every 200th saved frame also store full coordinates
    save_every_n_frames=1000  # save every 1000th simulated frame)

# task property approach
t = engine.run(
    ...,
    selecion='{some selection str}', 
    time_restart_stride=200,  # every 200th saved frame also store full coordinates
    save_every_n_frames=1000  # save every 1000th simulated frame)

The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory.

There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it.

franknoe commented 7 years ago

Nuria and Tim can better comment on the practicalities than me, but I have a few general notes:

Does that help answer some of the questions?

Am 18/03/17 um 23:59 schrieb Jan-Hendrik Prinz:

Well, I decided to implement this in the following way (~ 95% done). Please comment

So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file.

The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly.

This behavior works but can lead to errors and requires a |Task| writer to know about the structure of a |Trajectory| and cover all the possible cases. Since we do not want the user to have a knowledge about the actual implementation of trajectories (unless he write a specific engine) we could propose /multi-file/ |File| objects.

    Multi-File

This is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like

|00000000.dcd 00000000.dcd.restart 00000000.dcd.reduced |

If you copy or move such a multi file, you will immediately copy and move all the files.

The |MultiFile| object has meta information and knows the basename and a list of extensions. THis also saves space in saving multiple related names. You can access the other locations using an attribute

file = Trajectory(...) print file.restart# Location() print file.reduced# Location()

It seems in general useful to treat some file groups as one, so I would add this to the |File| class.

    How to treat trajectories

The main question to be answered it, whether we hard-code the reduced capabilities into the |Trajectory| object or a sub-class), or do we let the user deal with it.

      1. Question

Assume you have some simple trajectories and some that use /reduced/ (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the |PyEMMAAnalysis|?

Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc...

      2. Question

When you call

traj= Trajectory() frame= traj[20]

what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory.

      3. Question

If we say that a |ReducedTrajectory| object exists, is the specific reduction something that is also fixed for a particular engine or is that an option of the task generator.

engine attribute approach (recommended)

engine= OpenMMEngineReduced( ..., selecion='{some selection str}', time_restart_stride=200,# every 200th saved frame also store full coordinates save_every_n_frames=1000 # save every 1000th simulated frame)

task property approach

t = engine.run( ..., selecion='{some selection str}', time_restart_stride=200,# every 200th saved frame also store full coordinates save_every_n_frames=1000 # save every 1000th simulated frame)

The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory.

There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287581166, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQgsfGYB8Bp0FfSqaCQNfBvPsVCgnks5rnGHhgaJpZM4Mfk3N.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

jhprinz commented 7 years ago

The example below suggests that you only consider variations of dcds.

Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class.

But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.

A (probably too complex) example

00000000/
    traj.dcd  # or blablu.xtc, ..., e.g.full, but stride 100
    restart.npz
    reduced1.xtc  # e.g. without water, but stride 10
    reduced2.xtc  # e.g. only active site, but stride 1

So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.

This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible.

in the Task code this would then look like

t = Task(...)
my_traj = t.get(traj_obj, 'traj')  # this link the whole folder to the working directory as `traj/`

# then load a file by concatenation
np_file = np.load(os.join(my_traj, 'restart.npz'))

# perhaps some helper method
np_file = np.load(my_traj.path_to('restart.npz'))

That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.

Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.

franknoe commented 7 years ago

That looks good to me. What about storing metadata such as timepoints? 

Sent from my T-Mobile 4G LTE Device

-------- Original message -------- From: Jan-Hendrik Prinz notifications@github.com Date: 3/20/17 3:22 AM (GMT-06:00) To: markovmodel/adaptivemd adaptivemd@noreply.github.com Cc: Frank Noe frank.noe@fu-berlin.de, Comment comment@noreply.github.com Subject: Re: [markovmodel/adaptivemd] Reduced Trajectories (#23)

The example below suggests that you only consider variations of dcds.

Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder. A (probably too complex) example 00000000/ traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100 restart.npz reduced1.xtc # e.g. without water, but stride 10 reduced2.xtc # e.g. only active site, but stride 1

So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure. This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible. in the Task code this would then look like t = Task(...) my_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as traj/

then load a file by concatenation

np_file = np.load(os.join(my_traj, 'restart.npz'))

perhaps some helper method

np_file = np.load(my_traj.path_to('restart.npz'))

That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist. Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/adaptivemd","title":"markovmodel/adaptivemd","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/adaptivemd"}},"updates":{"snippets":[{"icon":"PERSON","message":"@jhprinz in #23: \u003e The example below suggests that you only consider variations of dcds.\r\n\r\nWell, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. \r\n\r\nBut, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.\r\n\r\nA (probably too complex) example\r\n\r\n\r\n00000000/\r\n traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100\r\n restart.npz\r\n reduced1.xtc # e.g. without water, but stride 10\r\n reduced2.xtc # e.g. only active site, but stride 1\r\n\r\n\r\nSo the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.\r\n\r\nThis approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible. \r\n\r\nin the Task code this would then look like\r\n\r\nt = Task(...)\r\nmy_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as `traj/`\r\n\r\n# then load a file by concatenation\r\nnp_file = np.load(os.join(my_traj, 'restart.npz'))\r\n\r\n# perhaps some helper method\r\nnp_file = np.load(my_traj.path_to('restart.npz'))\r\n\r\n\r\nThat's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.\r\n\r\nAnother thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside."}],"action":{"name":"View Pull Request","url":"https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287698286"}}}

jhprinz commented 7 years ago

You mean inside the directory? That would be possible but doubling data in the DB. There is an entry in the DB, which depends on the class you use and you can always subclass a Trajectory and add more metadata, e.g. The current Trajectory supports restart already. I would leave that, but turn the restart into a bool. Marking if there is a restart file (but noone know the name other then the engine that wrote it).

If you want the Reduced Trajectory we could add these to the main class or you subclass into ReducedTrajectory and it has metadata like used stride the atom-selection string, etc. That is arbitrary and you can use all of that in your adaptive code. But it becomes less reusable because it gets engine specific eventually. So a gromacs user might need another subclass etc...

I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself.

class Trajectory(...):
    """
    location : `Location`
        the physical location of the file an the HPC
    initial : `Frame` or `File` or None
        the initial frame if known. Can be the frame of another file
    length : int
        the length in frames of the trajectory
    # either engine and/or task. I think task should be fine
    engine : `Engine` : None
        a reference to the engine used. The engine knows things like strides/ 
        save intervals, selections, topology, etc...
    task : `Task` or None
        the task used to generate the trajectory. Since a task knows the engine 
        you know which engine was used, too
franknoe commented 7 years ago

I think adding some basic information which times are associated with the frames is universal and can be done with using any engine as a backend (that doesn't necessarily mean that information is in the trajectory file itself, which is why I think the sampling framework would be the natural place to take care of that).

Maybe it's sufficient to set a t0 and dt parameter (starting time and timestep)? Assuming that the saving time is regular, it would then be possible to relate different versions of this trajectory with different saving intervals, or to concatenate continued trajectories (which may or may not have overlaps that need to be removed) etc.

Am 20/03/17 um 10:09 schrieb Jan-Hendrik Prinz:

You mean inside the directory? That would be possible but doubling data in the DB. There is an entry in the DB, which depends on the class you use and you can always subclass a |Trajectory| and add more metadata, e.g. The current |Trajectory| supports restart already. I would leave that, but turn the |restart| into a bool. Marking if there is a restart file (but noone know the name other then the engine that wrote it).

If you want the Reduced Trajectory we could add these to the main class or you subclass into |ReducedTrajectory| and it has metadata like used stride the atom-selection string, etc. That is arbitrary and you can use all of that in your adaptive code. But it becomes less reusable because it gets engine specific eventually. So a gromacs user might need another subclass etc...

I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself.

class Trajectory(...): """ location : Location the physical location of the file an the HPC initial : Frame or File or None the initial frame if known. Can be the frame of another file length : int the length in frames of the trajectory

either engine and/or task. I think task should be fine

engine : Engine : None a reference to the engine used. The engine knows things like strides/ save intervals, selections, topology, etc... task : Task or None the task used to generate the trajectory. Since a task knows the engine you know which engine was used, too

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287706404, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQr8l-TRsUt85pXxYQY9mIah1EuaAks5rnkJkgaJpZM4Mfk3N.

--


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de Mail: Arnimallee 6, 14195 Berlin, Germany

thempel commented 7 years ago

engine = OpenMMEngine(..., output=output_dict)


One could reference the trajectories by their `name` and name the files like that, making it very flexible and also self-explanatory to the user. The trajectory that has to be used for frame-picking should have something like a `'master': True` to be found. 
We spoke about computing features for already finished trajectories, which in this formulation would mean something like `engine.add_output_type({some dictionary...})`. Probably it would make sense to embed this better in the `project`-object... Is that too general? 
jhprinz commented 7 years ago

Question 3: From my perspective, it makes sense to couple the different time-strides to the engine in a immutable way. If the user changes these settings, everything will break down. On the other hand however, it would be good to be able to add feature trajectories for the case of an adaptive strategy change. I was thinking of some kind of dictionary, maybe something like

I like that idea. What about this:

engine = OpenMMEngine(..., )
engine.dt  # get the integrator stepping
>>> 2 fs
engine.add_output_type(
    name='master',
    file='output.dcd'  # default would be name + '.dcd'
    save_every=1000  # or use str `10ps`
)
engine.add_output_type(
    name='protein',
    save_every=100,  # or use str `1ps`
    selection='protein'
)
engine.add_output_type(
    name='intramoldist',
    save_every=100,  # or use `1ps
    pyemma='feature description'  # I guess that is overkill for now
)

Then the engine would just now what to do and what the structure of the folder is

traj/
  output.dcd  # should have a simple default setting option
  protein.dcd
  intramoldist.npy
  restart.npz  # always there

If you have a trajectory generated by that engine you can (already) do:

t = Trajectory(...)
t.file('output.dcd')  # get a location of the file

# simpler would be
t.output('master')

# and to get all files of a certain type run
project.trajectories.all.output('protein')  # this calls the output() on all trajectories
thempel commented 7 years ago

Looks good!

Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types?

jhprinz commented 7 years ago

We spoke about computing features for already finished trajectories, which in this formulation would mean something like engine.add_output_type({some dictionary...}). Probably it would make sense to embed this better in the project-object... Is that too general?

A few things to consider when you change (or only add) output types along the way

  1. To not violate immutability you need to create a new engine, but that would be possible and simple. Still you now have 2 engines and need to get the right one.

  2. What happens, when you want to extend a trajectory? All of the previous trajectories do not have the new output computed. So you would have either run all trajectories and compute the new output or forbid to do that with existing trajectories.

  3. As you said, you could have a project function that can enhance an existing engine and make sure that all is prepared after adding a new output.

All in all not trivial...

jhprinz commented 7 years ago

Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types?

Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.

From the engine, maybe something like

engine = OpenMM(...)
engine.add_output_type(...)
# as short for
engine.outputs[name] = OutputTypeDecriptor(...)
and access it in the same way
# option 1
engine.outputs['protein'].save_every

# option 2 - access via attribute if the name is python compatible
engine.protein.save_every

# option 3 - rename the file access with __getitem__
engine['protein'].save_every
# and access the files (which is really no problem)
engine.files['my_file']
# this option could also be a shortcut for option 1

using the descriptor you can use it also as an input to other generators (pyemma) without storing it again.

modeller = PyEMMA(traj_descriptor=engine['protein'])

I think that looks pretty nice

jhprinz commented 7 years ago

The Descriptors can even be independent of the Engine and be used (to some extend) for other engines, too. At least options like the filename or saving interval work everywhere. Selection could be difficult.

thempel commented 7 years ago

New Trajectory Output Types In practice, when deciding on a new feature for the analysis, the MSM needs to be build for all trajectories. Not using the trajectories that have been computed before would generally not be a good idea. But I see the problem that it is not necessarily known outside the actual MD script, how these features are actually computed. Hmm. The simplest (and probably most messy...) approach that comes to my mind would be something like this (related to your bullet point 3) with a string to be executed.

trajs = [trj['protein'] for trj in project.trajectories if trj.engine.name == 'old engine']
command = 'import mdtraj\n mdtraj.load(INPUT).save_xtc(OUTPUT)'  # INPUT and OUTPUT being called internally to be compatible to the DB
task = complete_existing_trajectories(trajs, command)
project.queue(task)

I'm not sure it this is a good idea, though. Since the command cannot be extracted from the engine, it is impossible to check whether anything is congruent at all. Maybe it's better if the user does this manually and there is only a function to add the files, that were generated, to the DB.

Output Type Descriptors

Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.

I like that approach, looks nice and fits with the rest.

jhprinz commented 7 years ago

New Trajectory Output Types

Correct, this will usually be part of a strategy. Just running Pyemma with new features is trivial, of course. Create a new PyEmma generator and use this one instead. If you want to add a new reduced trajectory or a feature trajectory that is automatically used is another matter.

Not using the trajectories that have been computed before would generally not be a good idea.

of course, we always use what we have.

So, when you want to add new output types whcih produce .dcd (or whatever you are using) or feature trajcectories with numpy/pyemma you will probably have to proceed like this:

  1. Create a new enhanced engine (NEW), that has the old and new outputs and keeps the rest.

  2. So far all existing trajctories are compatible with the old engine (OLD), so if you have a traj of length 100, then all the outputs ran with the old engine conform to that length (depending on individual strides, of course)

  3. Now you can run an traj update, that will take an existing trajectory and update it to the new engine like

new_engine.update(trajectory)

This will create a new trajectory with the same name, but the updated engine and will update all outputs to match the new engine. Limitation is that you can only add outputs, not remove some. That would lead to inconsistencies.

  1. if you want, run the update on all trajectories or whichever you want. This will update your project

  2. If you want to extend a trajectory then you have to first update.

Looks not too complicated and concise in that a trajectory and its engine fit together.

thempel commented 7 years ago

This looks much better than what I posted above, but I don't understand how this can internally be handled. How does engine know how to generate the output, e.g. if this is normally done by an openMM reporter?

jhprinz commented 7 years ago

How does engine know how to generate the output, e.g. if this is normally done by an openMM reporter?

Well, the engine needs to pass (somehow) the output types to openmmrun.py This is the most annoying part. Not difficult but requires some work.

As an example you need to call openmmrun.py with

--outputtypes="{'master: {'file': 'output.dcd', 'stride': 10, 'selection': 'protein'},  ... }"

I think just JSON would be good. Maybe replace " with ' to make it look like python and still have " for the bash. This is the easiest.