Closed jhprinz closed 7 years ago
Well, I decided to implement this in the following way (~ 95% done). Please comment
So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file.
The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly.
This behavior works but can lead to errors and requires a Task
writer to know about the structure of a Trajectory
and cover all the possible cases. Since we do not want the user to have a knowledge about the actual implementation of trajectories (unless he write a specific engine) we could propose multi-file File
objects.
This is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like
00000000.dcd
00000000.dcd.restart
00000000.dcd.reduced
If you copy or move such a multi file, you will immediately copy and move all the files.
The MultiFile
object has meta information and knows the basename and a list of extensions. THis also saves space in saving multiple related names. You can access the other locations using an attribute
file = Trajectory(...)
print file.restart # `Location()`
print file.reduced # `Location()`
It seems in general useful to treat some file groups as one, so I would add this to the File
class.
The main question to be answered it, whether we hard-code the reduced capabilities into the Trajectory
object or a sub-class), or do we let the user deal with it.
Assume you have some simple trajectories and some that use reduced (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the PyEMMAAnalysis
?
Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc...
When you call
traj = Trajectory()
frame = traj[20]
what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory.
If we say that a ReducedTrajectory
object exists, is the specific reduction something that is also fixed for a particular engine or is that an option of the task generator.
# engine attribute approach (recommended)
engine = OpenMMEngineReduced(
...,
selecion='{some selection str}',
time_restart_stride=200, # every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame)
# task property approach
t = engine.run(
...,
selecion='{some selection str}',
time_restart_stride=200, # every 200th saved frame also store full coordinates
save_every_n_frames=1000 # save every 1000th simulated frame)
The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory.
There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it.
Nuria and Tim can better comment on the practicalities than me, but I have a few general notes:
It's useful to consider "bundles" of files generated by a simulation task that belong together. But these could be files of different types. The example below suggests that you only consider variations of dcds. How about organizing the outputs of a simulation in a results folder that can contain multiple files with arbitrary types / file extensions?
The analysis task should at least be able to get a full list of all files from a run, because we don't know yet what type of analysis we wanna do in the future. The most typical analysis is to process the full-frame trajectories without water, but we definitely shouldn't limit ourselves to only being able to do that.
I think indexes used literally as below (traj[20]) should always just refer to the 20th element of that list, so it depends on traj what that means in terms of time. However, it would be very useful to have the times available as metadata. For example, if you have generated a simulation file with 1000 Frames, saving at 100 ps intervals, and a strided version with 10 Frames saving at 10 ns intervals, it would be great to have these data available: traj.dt
100 traj.time_unit 'ps' traj.times [0, 1, ..., 999] # this also allows to encode when a trajectory has been continued - e.g. we could start with a different time than 0.
traj_strided.dt
10000 traj_strided.time_unit 'ps' traj_strided.times [0, 100, ..., 900] # with this information we can map between trajectories
Does that help answer some of the questions?
Am 18/03/17 um 23:59 schrieb Jan-Hendrik Prinz:
Well, I decided to implement this in the following way (~ 95% done). Please comment
So far, I have a strict single file approach. You have commands to express copy, move, link etc of a single file and a trajectory is considered a single file which you can copy, etc like any other file.
The restart file is treated also like a separate file and everywhere you are dealing with trajectories - in principle - you need to check if it is a trajectory with restart or not and copy the restart file as well. SInce we now want to use a their file for the reduced trajectory we always have to check if this is the case for a particular trajectory and act accordingly.
This behavior works but can lead to errors and requires a |Task| writer to know about the structure of a |Trajectory| and cover all the possible cases. Since we do not want the user to have a knowledge about the actual implementation of trajectories (unless he write a specific engine) we could propose /multi-file/ |File| objects.
Multi-File
This is simply treated as a single file with a name, but can have additional files with it, that only differ by an extra extension. Like
|00000000.dcd 00000000.dcd.restart 00000000.dcd.reduced |
If you copy or move such a multi file, you will immediately copy and move all the files.
The |MultiFile| object has meta information and knows the basename and a list of extensions. THis also saves space in saving multiple related names. You can access the other locations using an attribute
file = Trajectory(...) print file.restart#
Location()
print file.reduced#Location()
It seems in general useful to treat some file groups as one, so I would add this to the |File| class.
How to treat trajectories
The main question to be answered it, whether we hard-code the reduced capabilities into the |Trajectory| object or a sub-class), or do we let the user deal with it.
1. Question
Assume you have some simple trajectories and some that use /reduced/ (have a time-stride full trajectory and a full-time atom-reduced trajectory). Which one do you pass to the |PyEMMAAnalysis|?
Or do you pass the multi-file object and the pyemma task figures out what to do. At least, the generator knows all the trajectory files that are passed to it and it could pass the appropriate files like reduced topology, etc...
2. Question
When you call
traj= Trajectory() frame= traj[20]
what is frame. Does it relate to the native time steps or does it relate to the steps in the time-stride full trajectory. And if the first, what do you get or do, when you use a frame from which you cannot start because this one is only in the atom-reduced trajectory.
3. Question
If we say that a |ReducedTrajectory| object exists, is the specific reduction something that is also fixed for a particular engine or is that an option of the task generator.
engine attribute approach (recommended)
engine= OpenMMEngineReduced( ..., selecion='{some selection str}', time_restart_stride=200,# every 200th saved frame also store full coordinates save_every_n_frames=1000 # save every 1000th simulated frame)
task property approach
t = engine.run( ..., selecion='{some selection str}', time_restart_stride=200,# every 200th saved frame also store full coordinates save_every_n_frames=1000 # save every 1000th simulated frame)
The engine approach makes sense, but means to create a new engine if you want to change these settings. If so, then we should store the engine used with a trajectory.
There are more things to consider to really keep all of this consistent. I will continue tomorrow and think about it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287581166, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQgsfGYB8Bp0FfSqaCQNfBvPsVCgnks5rnGHhgaJpZM4Mfk3N.
--
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
The example below suggests that you only consider variations of dcds.
Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class.
But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd
will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.
A (probably too complex) example
00000000/
traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100
restart.npz
reduced1.xtc # e.g. without water, but stride 10
reduced2.xtc # e.g. only active site, but stride 1
So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.
This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible.
in the Task code this would then look like
t = Task(...)
my_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as `traj/`
# then load a file by concatenation
np_file = np.load(os.join(my_traj, 'restart.npz'))
# perhaps some helper method
np_file = np.load(my_traj.path_to('restart.npz'))
That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.
Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.
That looks good to me. What about storing metadata such as timepoints?
Sent from my T-Mobile 4G LTE Device
-------- Original message -------- From: Jan-Hendrik Prinz notifications@github.com Date: 3/20/17 3:22 AM (GMT-06:00) To: markovmodel/adaptivemd adaptivemd@noreply.github.com Cc: Frank Noe frank.noe@fu-berlin.de, Comment comment@noreply.github.com Subject: Re: [markovmodel/adaptivemd] Reduced Trajectories (#23)
The example below suggests that you only consider variations of dcds.
Well, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. But, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder. A (probably too complex) example 00000000/ traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100 restart.npz reduced1.xtc # e.g. without water, but stride 10 reduced2.xtc # e.g. only active site, but stride 1
So the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.
This approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible.
in the Task code this would then look like
t = Task(...)
my_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as traj/
np_file = np.load(os.join(my_traj, 'restart.npz'))
np_file = np.load(my_traj.path_to('restart.npz'))
That's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist. Another thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/adaptivemd","title":"markovmodel/adaptivemd","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/adaptivemd"}},"updates":{"snippets":[{"icon":"PERSON","message":"@jhprinz in #23: \u003e The example below suggests that you only consider variations of dcds.\r\n\r\nWell, no, that was not the point, the bundle idea was like dictionary but with a fixed set of files and types. What these files are and what they represent depends on the specific class. \r\n\r\nBut, I like the idea of just referencing a directory. That is much simpler and practically already implemented. So we have a folder and adaptivemd
will only move the complete folder around. The specific content is not known to the framework in general, but is specific to the engine. You also do not register each file separately in the DB but only the folder.\r\n\r\nA (probably too complex) example\r\n\r\n\r\n00000000/\r\n traj.dcd # or blablu.xtc, ..., e.g.full, but stride 100\r\n restart.npz\r\n reduced1.xtc # e.g. without water, but stride 10\r\n reduced2.xtc # e.g. only active site, but stride 1\r\n
\r\n\r\nSo the engine you use writes these files and if you want to restart from a frame or continue a trajectory, then the engine need to make sense out of your file structure.\r\n\r\nThis approach is somewhat less organized than the previous attempt and requires the user to be more careful, but it is MUCH more flexible. \r\n\r\nin the Task code this would then look like\r\n\r\nt = Task(...)\r\nmy_traj = t.get(traj_obj, 'traj') # this link the whole folder to the working directory as `traj/`\r\n\r\n# then load a file by concatenation\r\nnp_file = np.load(os.join(my_traj, 'restart.npz'))\r\n\r\n# perhaps some helper method\r\nnp_file = np.load(my_traj.path_to('restart.npz'))\r\n
\r\n\r\nThat's it. Nothing else needs to change. It also means, that you cannot directly get specific files, because the framework does not explicitely know, which files will exist.\r\n\r\nAnother thing to consider is, that all real trajectory files will have the same name now! In different folders, but still. That could be a downside."}],"action":{"name":"View Pull Request","url":"https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287698286"}}}
You mean inside the directory? That would be possible but doubling data in the DB.
There is an entry in the DB, which depends on the class you use and you can always subclass a Trajectory
and add more metadata, e.g. The current Trajectory
supports restart already. I would leave that, but turn the restart
into a bool. Marking if there is a restart file (but noone know the name other then the engine that wrote it).
If you want the Reduced Trajectory we could add these to the main class or you subclass into ReducedTrajectory
and it has metadata like used stride the atom-selection string, etc. That is arbitrary and you can use all of that in your adaptive code. But it becomes less reusable because it gets engine specific eventually. So a gromacs user might need another subclass etc...
I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself.
class Trajectory(...):
"""
location : `Location`
the physical location of the file an the HPC
initial : `Frame` or `File` or None
the initial frame if known. Can be the frame of another file
length : int
the length in frames of the trajectory
# either engine and/or task. I think task should be fine
engine : `Engine` : None
a reference to the engine used. The engine knows things like strides/
save intervals, selections, topology, etc...
task : `Task` or None
the task used to generate the trajectory. Since a task knows the engine
you know which engine was used, too
I think adding some basic information which times are associated with the frames is universal and can be done with using any engine as a backend (that doesn't necessarily mean that information is in the trajectory file itself, which is why I think the sampling framework would be the natural place to take care of that).
Maybe it's sufficient to set a t0 and dt parameter (starting time and timestep)? Assuming that the saving time is regular, it would then be possible to relate different versions of this trajectory with different saving intervals, or to concatenate continued trajectories (which may or may not have overlaps that need to be removed) etc.
Am 20/03/17 um 10:09 schrieb Jan-Hendrik Prinz:
You mean inside the directory? That would be possible but doubling data in the DB. There is an entry in the DB, which depends on the class you use and you can always subclass a |Trajectory| and add more metadata, e.g. The current |Trajectory| supports restart already. I would leave that, but turn the |restart| into a bool. Marking if there is a restart file (but noone know the name other then the engine that wrote it).
If you want the Reduced Trajectory we could add these to the main class or you subclass into |ReducedTrajectory| and it has metadata like used stride the atom-selection string, etc. That is arbitrary and you can use all of that in your adaptive code. But it becomes less reusable because it gets engine specific eventually. So a gromacs user might need another subclass etc...
I would like to keep it as easy as possible. Maybe the following metadata by default. If you need more, say (multiple reduced trajectories, multiple strides, ...) you need to subclass and do that yourself.
class Trajectory(...): """ location :
Location
the physical location of the file an the HPC initial :Frame
orFile
or None the initial frame if known. Can be the frame of another file length : int the length in frames of the trajectoryeither engine and/or task. I think task should be fine
engine :
Engine
: None a reference to the engine used. The engine knows things like strides/ save intervals, selections, topology, etc... task :Task
or None the task used to generate the trajectory. Since a task knows the engine you know which engine was used, too— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markovmodel/adaptivemd/pull/23#issuecomment-287706404, or mute the thread https://github.com/notifications/unsubscribe-auth/AGMeQr8l-TRsUt85pXxYQY9mIah1EuaAks5rnkJkgaJpZM4Mfk3N.
--
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Taking folders sounds perfect to me. It also reduces the trajectory folder size for huge projects with 1,000s of trajectories. In the previous framework, the trajectory files were organized like that (with the same names) and as I recall, we did not have any issues with that.
Concerning Question 1, as Frank already said, normally the analysis is done on the protein-only trajectory (atom-reduced in your terms) while the frames are drawn from the all-atom (time-reduced) trajectory.
Question 3: From my perspective, it makes sense to couple the different time-strides to the engine in a immutable way. If the user changes these settings, everything will break down. On the other hand however, it would be good to be able to add feature trajectories for the case of an adaptive strategy change. I was thinking of some kind of dictionary, maybe something like
output_dict = [{'name' : 'all_atom_traj',
'selection' : 'all',
'save_every_frames' : 1000,
'dt' : 100,
'time_unit' : 'ps',
'master': True, ...},
{'name' : 'protein_traj',
'selection' : 'protein', ...},
{'intramolecular_distances', ...}, ...]
engine = OpenMMEngine(..., output=output_dict)
One could reference the trajectories by their `name` and name the files like that, making it very flexible and also self-explanatory to the user. The trajectory that has to be used for frame-picking should have something like a `'master': True` to be found.
We spoke about computing features for already finished trajectories, which in this formulation would mean something like `engine.add_output_type({some dictionary...})`. Probably it would make sense to embed this better in the `project`-object... Is that too general?
Question 3: From my perspective, it makes sense to couple the different time-strides to the engine in a immutable way. If the user changes these settings, everything will break down. On the other hand however, it would be good to be able to add feature trajectories for the case of an adaptive strategy change. I was thinking of some kind of dictionary, maybe something like
I like that idea. What about this:
engine = OpenMMEngine(..., )
engine.dt # get the integrator stepping
>>> 2 fs
engine.add_output_type(
name='master',
file='output.dcd' # default would be name + '.dcd'
save_every=1000 # or use str `10ps`
)
engine.add_output_type(
name='protein',
save_every=100, # or use str `1ps`
selection='protein'
)
engine.add_output_type(
name='intramoldist',
save_every=100, # or use `1ps
pyemma='feature description' # I guess that is overkill for now
)
Then the engine would just now what to do and what the structure of the folder is
traj/
output.dcd # should have a simple default setting option
protein.dcd
intramoldist.npy
restart.npz # always there
If you have a trajectory generated by that engine you can (already) do:
t = Trajectory(...)
t.file('output.dcd') # get a location of the file
# simpler would be
t.output('master')
# and to get all files of a certain type run
project.trajectories.all.output('protein') # this calls the output() on all trajectories
Looks good!
Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every
etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types
?
We spoke about computing features for already finished trajectories, which in this formulation would mean something like engine.add_output_type({some dictionary...}). Probably it would make sense to embed this better in the project-object... Is that too general?
A few things to consider when you change (or only add) output types along the way
To not violate immutability you need to create a new engine, but that would be possible and simple. Still you now have 2 engines and need to get the right one.
What happens, when you want to extend a trajectory? All of the previous trajectories do not have the new output computed. So you would have either run all trajectories and compute the new output or forbid to do that with existing trajectories.
As you said, you could have a project function that can enhance an existing engine and make sure that all is prepared after adding a new output.
All in all not trivial...
Now, how would you access the trajectory-related properties, or how is that stored? The user would probably need to extract save_every etc. for all or a given output type. Maybe something like the mentioned dictionary as a property, like engine.output_types?
Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.
From the engine, maybe something like
engine = OpenMM(...)
engine.add_output_type(...)
# as short for
engine.outputs[name] = OutputTypeDecriptor(...)
and access it in the same way
# option 1
engine.outputs['protein'].save_every
# option 2 - access via attribute if the name is python compatible
engine.protein.save_every
# option 3 - rename the file access with __getitem__
engine['protein'].save_every
# and access the files (which is really no problem)
engine.files['my_file']
# this option could also be a shortcut for option 1
using the descriptor you can use it also as an input to other generators (pyemma) without storing it again.
modeller = PyEMMA(traj_descriptor=engine['protein'])
I think that looks pretty nice
The Descriptors can even be independent of the Engine and be used (to some extend) for other engines, too. At least options like the filename or saving interval work everywhere. Selection could be difficult.
New Trajectory Output Types In practice, when deciding on a new feature for the analysis, the MSM needs to be build for all trajectories. Not using the trajectories that have been computed before would generally not be a good idea. But I see the problem that it is not necessarily known outside the actual MD script, how these features are actually computed. Hmm. The simplest (and probably most messy...) approach that comes to my mind would be something like this (related to your bullet point 3) with a string to be executed.
trajs = [trj['protein'] for trj in project.trajectories if trj.engine.name == 'old engine']
command = 'import mdtraj\n mdtraj.load(INPUT).save_xtc(OUTPUT)' # INPUT and OUTPUT being called internally to be compatible to the DB
task = complete_existing_trajectories(trajs, command)
project.queue(task)
I'm not sure it this is a good idea, though. Since the command cannot be extracted from the engine, it is impossible to check whether anything is congruent at all. Maybe it's better if the user does this manually and there is only a function to add the files, that were generated, to the DB.
Output Type Descriptors
Hmm, I would create a new OutputTypeDescription class which basically is like a dict, but you can access the attributes with a dot.
I like that approach, looks nice and fits with the rest.
New Trajectory Output Types
Correct, this will usually be part of a strategy. Just running Pyemma with new features is trivial, of course. Create a new PyEmma generator and use this one instead. If you want to add a new reduced trajectory or a feature trajectory that is automatically used is another matter.
Not using the trajectories that have been computed before would generally not be a good idea.
of course, we always use what we have.
So, when you want to add new output types whcih produce .dcd
(or whatever you are using) or feature trajcectories with numpy/pyemma you will probably have to proceed like this:
Create a new enhanced engine (NEW), that has the old and new outputs and keeps the rest.
So far all existing trajctories are compatible with the old engine (OLD), so if you have a traj of length 100, then all the outputs ran with the old engine conform to that length (depending on individual strides, of course)
Now you can run an traj update, that will take an existing trajectory and update it to the new engine like
new_engine.update(trajectory)
This will create a new trajectory with the same name, but the updated engine and will update all outputs to match the new engine. Limitation is that you can only add outputs, not remove some. That would lead to inconsistencies.
if you want, run the update on all trajectories or whichever you want. This will update your project
If you want to extend a trajectory then you have to first update.
Looks not too complicated and concise in that a trajectory and its engine fit together.
This looks much better than what I posted above, but I don't understand how this can internally be handled. How does engine
know how to generate the output, e.g. if this is normally done by an openMM reporter?
How does engine know how to generate the output, e.g. if this is normally done by an openMM reporter?
Well, the engine needs to pass (somehow) the output types to openmmrun.py
This is the most annoying part. Not difficult but requires some work.
As an example you need to call openmmrun.py with
--outputtypes="{'master: {'file': 'output.dcd', 'stride': 10, 'selection': 'protein'}, ... }"
I think just JSON would be good. Maybe replace " with ' to make it look like python and still have " for the bash. This is the easiest.
Allow trajectories to be saved by a frame-reduced full-atom trajectory + a atom-reduced full-frame trajectory.
This changes a
Trajectory
or its subclass to be a group of files. These should be handled like an ordinary file, but a name change will alter all files in the same way.This needs some thought.
But finally you will have a
ReducedTrajectory
that is a group of 3 files.And a
RestartTrajectory
justOr whatever combination you want. Copying, etc will work on the whole group and I think it makes sense, to use one filename + multiple extensions Instead of multiple unique filenames.