Closed wehs7661 closed 3 years ago
The first half (Trp-cage example) of the notebook kinetic_energy_equipartition.ipynb is not executable, which I believe the main reason is that the potential/kinetic energies are not read in such that simulation_data.observables is None and does not have the attribute kinetic_energy_per_molecule. However, the energy data is missing from trp-cage.py and I did not find any raw data of the simulation such that I could create arrays of energy and add them to trp-cage.py. I could potentially use the GROMACS parser to parse run.edr in the repo physical_validation_workshop, but in our new documentation, it was not clear which kind of thermostat was used and there are two options available in the workshop repo (be and vr), so I was not sure which set of simulation data I should use.
I'm pretty sure that this is due to you not using the latest master version of physical_validation
to run the notebook, because this looks like a bug I fixed in the last weeks. Can you verify this?
We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same.
The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?
The issue with the previous documentation was that it relied on package-specific parsers too much. I'm wondering if in the new documentation we want to encourage creating SimulationData objects from python data. If so, we might want to talk more about creating an array for the argument mass to the SystemData object. Currently, the example shown in the documentation is the 900-water example, so the creation of the array mass is straightforward. However, for a more complicated example (like a protein-ligand binding complex), we might want to mention how to efficiently create such an array other than using the GROMACS parser. However, if we are just pointing out that it is possible to create SimulationData from python data even if it could be more tedious than using a package-specific parser, then I think it's reasonable to not mention the creation of the array mass. (The creation of such an array is pretty easy anyway so this issue might be trivial though.)
That's a good point. Maybe we could just add a slightly more complex example right below? Like "Note that numpy offers a few nice tricks to create the mass vector efficiently. For example, for a protein-ligand binding complex, one could do [...]"?
For the function physical_validation.kinetic_energy.equipartition, I was wondering if it would be useful that the user is allowed to specify molec_groups with a string like LIG, SOL, or any molecule names that can be found in the GROMACS topology file. That is, if molec_groups is a string, we parse the topology file and find the corresponding atom indices. If the given string is not shown in the topology file, the function returns an error. Or do we actually want the leave this task to the users such that they should parse their topology file by themselves to get the array to be passed to the function? This is not urgent anyway.
That's a great suggestion, but also a prime example of why I think we should move away from parsers. It would be a non-trivial job to implement this, and then we'd likely still miss a few corner cases, and whenever GROMACS decides to change something we need to make sure to follow (and probably support both the old and the new version because adoption of new formats takes time). And then we should ideally be able to offer the same for LAMMPS, OpenMM, HOOMD, AMBER, and so on...
What would be great is if we could use MDTraj (or a similar package) to handle that. I think MDTraj might have the required knowledge of the topology, but unfortunately (as far as I know) still doesn't allow easy access to the velocities. But if you or someone else has some experience with MDTraj and could contribute an example that would be super useful. Otherwise I'd suggest to keep that as a goal for a future version.
Can GrommacsParser read in xtc files? Some people might prefer to output the simulation trajectory as an xtc file rather than a trr file due to a smaller size. Those users could convert the xtc file to a gro file trajectory since reading a gro file trajectory seems to be the other option though.
No, and it's probably not worth to add because xtc doesn't store velocities. The tests need either both the positions and the velocities (for the equipartition tests) or none of the two.
In the section of "Simulation data", the sentence "The different physical validation tests do not all require all data to be able to run" could probably be more clear. If I understand correctly, this just means that some tests (integrator and ensemble validation) only require mdp, top, gro and edr files, while some other tests (equipartition) did not require a gro file but the trajectory file (trr file). I have not made any changes for this sentence but I think maybe we could explicitly mention the required inputs for different tests. (Maybe we can use bullet points?)
Agreed that it can be clarified. I'll take a pass at that.
For a lot of attributes list on the page of Data Format and Parsers, there are a lot of times that a certain function or attribute are mentioned but with an italic font instead of an inline code. Do we want to term them into inline codes (i.e. add :attr: or :func: before each singly-quoted statement)? (Example: ObservableData.kinetic_energy, the kinetic energy trajectory (nframes x 1), also accessible via .ObservableData[‘kinetic_energy’]) )
Good point. Do you want to take a shot at that?
This is minor as well, but I just felt that when plotting probability distributions of the samples obtained in the simulation, the number of bins could be higher to show its consistency (or inconsistency) with the expected distribution more easily (especially the ones in the kinetic energy equipartition example now shown in the documentation). This might involve how flexible we want our code to be in terms of specifying the settings of the plots.
Agreed, I'm looking into that. I'll make this a separate PR if I find a satisfactory solution. Note that all the plotting is meant as a visual help, not necessarily as publication-worthy figures - it's very hard to find settings which look good under all circumstances. Do you think that we should emphasize that somewhere?
Thank you @wehs7661, this is extremely helpful!!! I tried to answer to the single points in separate comments above so we can keep an overview. Also, readthedocs is creating the documentation for PRs (this one is at https://physical-validation--163.org.readthedocs.build/en/163/index.html). It should post to the PR as well, I'm not sure why that doesn't work, will investigate.
Finally, I noticed that changes in the notebooks are extremely difficult to review. Maybe we should create an environment (maybe a docker image?) which we use to run the notebooks. That way the differences between versions would only be due to changes we make, instead of being a mix of our changes and our different environments.
(We could also upload the notebooks without running them and have them running on creation of the documentation, but for that we would need to understand if we'd run into some resource problems (some of the examples take quite a long time and some memory to run). Also, I like the fact that we can look at the notebook on Github and have some output right there.)
Until we've figured that out, could you give a quick overview of what you changed in them?
I'm pretty sure that this is due to you not using the latest master version of physical_validation to run the notebook, because this looks like a bug I fixed in the last weeks. Can you verify this?
Yes, I've updated my package locally and it was the version issue. Now the notebook is executable!
The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?
I think maybe mentioning exactly the way you got the data pasted in any file in the folder simulation_results
would be a reasonable starting point?
That's a good point. Maybe we could just add a slightly more complex example right below? Like "Note that numpy offers a few nice tricks to create the mass vector efficiently. For example, for a protein-ligand binding complex, one could do [...]"?
Yes, I agree. With MDAnalysis
, it should be as straightforward as below:
import MDAnalysis as mda
import numpy as np
u = mda.Universe('system.gro')
mass=np.array([u.atoms[i].mass for i in range(len(u.atoms))])
For a complex system where the users might only want to get the mass array of a certain molecule in the system, they could just adjust the range of the i value in the list comprehension. I'll add the above example in the documentation if you think this makes sense.
What would be great is if we could use MDTraj (or a similar package) to handle that.
That's probably the easier route. I'm not sure about MDTraj, but with MDAnalysis we could do the following to create molec_groups
given a gro
and a tpr
file:
import MDAnalysis as mda
import numpy as np
u = mda.Universe('system.tpr', 'system.gro')
molec_groups = []
for i in range(len(u.segments)):
seg = u.segments[i]
molec_groups.append(np.array([seg.atoms[j].index for j in range(len(seg.atoms))]))
The tpr
file was used so that the molecule groups defined in the GROMACS topology file are considered. If the users want to specify groups not defined in the topology file, they could consider using Universe.add_Segment
as exemplified on this page.
For the code above and the one for creating the parameter mass
for SystemData
, I've validated their correctness using my own GROMACS simulation files. Do we want to provide the files as an working example or maybe we could just provide the code in the documentation as long as we are sure they work?
Also, I can add the code above to the docstring of physical_validation.kinetic_energy.equipartition
as needed.
Agreed that it can be clarified. I'll take a pass at that.
Okay. I've just added that as a to-do so that we remember to do so.
Good point. Do you want to take a shot at that?
The issue has been fixed. Will push in the next commit.
Do you think that we should emphasize that somewhere?
I think that could be helpful. Also, I think maybe we could additionally output the data required for plotting so then we don't have to worry about the settings of the plots anymore. The user could just use those outputs to plot the figure with whatever settings they prefer.
Until we've figured that out, could you give a quick overview of what you changed in them?
I'm not familiar with docker images, but here are the diffs of the notebooks (except for different results produced due to different versions of the package or lack of a random seed):
doc/examples/ensemble_check.ipynb
:
doc/examples/integrator_validation.ipynb
:
doc/examples/kinetic_energy_distribution.ipynb
:
doc/examples/kinetic_energy_equipartition.ipynb
:
We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same.
The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?
I think maybe mentioning exactly the way you got the data pasted in any file in the folder
simulation_results
would be a reasonable starting point?
No, this doesn't really work. The examples should substitute a real interface, how they were created is irrelevant. GROMACS does not have a Python interface, that's why we have a parser. We should add examples for other packages, though, that do have a Python interface.
I did a rebase and tried to remove as much noise as possible from the ipynb to allow for review.
I implemented @wehs7661 remaining suggestions in a commit. This should be good to go now. There will be a follow-up to clean up some references which are out-of-date, but I think this PR has been large enough also like this :)
Description
As the final check of the documentation, in this PR I've fixed some typos and format issues and added additional explanatory texts in some of the
rst
files and jupyter notebooks. In the documentation, I've also indicated that the error estimates of the temperatures calculated in the non-strict kinetic distribution test will not be exactly reproducible if no random seed is set through the optionbootstrap_seed
. Overall, I think the documentation and the jupyter notebooks are very clear and straightforward. (There is an issue with one of the notebooks as mentioned below.) I've only found a few things that could be done to marginally improve the documentation. In the next section, I've listed some discussion points in the order of importance with the last few being the least important ones.Discussion points
The first half (Trp-cage example) of the notebook→ Resolved: Was due to an outdated version.kinetic_energy_equipartition.ipynb
is not executable, which I believe the main reason is that the potential/kinetic energies are not read in such thatsimulation_data.observables
isNone
and does not have the attributekinetic_energy_per_molecule
. However, the energy data is missing fromtrp-cage.py
and I did not find any raw data of the simulation such that I could create arrays of energy and add them totrp-cage.py
. I could potentially use the GROMACS parser to parserun.edr
in the repophysical_validation_workshop
, but in our new documentation, it was not clear which kind of thermostat was used and there are two options available in the workshop repo (be and vr), so I was not sure which set of simulation data I should use.We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same.→ Decided to leave this for now, probably out of scope of our documentation.mass
to theSystemData
object. Currently, the example shown in the documentation is the 900-water example, so the creation of the arraymass
is straightforward. However, for a more complicated example (like a protein-ligand binding complex), we might want to mention how to efficiently create such an array other than using the GROMACS parser. However, if we are just pointing out that it is possible to createSimulationData
from python data even if it could be more tedious than using a package-specific parser, then I think it's reasonable to not mention the creation of the array mass. (The creation of such an array is pretty easy anyway so this issue might be trivial though.)physical_validation.kinetic_energy.equipartition
, I was wondering if it would be useful that the user is allowed to specifymolec_groups
with a string likeLIG
,SOL
, or any molecule names that can be found in the GROMACS topology file. That is, ifmolec_groups
is a string, we parse the topology file and find the corresponding atom indices. If the given string is not shown in the topology file, the function returns an error. Or do we actually want the leave this task to the users such that they should parse their topology file by themselves to get the array to be passed to the function? This is not urgent anyway.Can→ Rejected: xtc files are not helpful, because they lack velocity data. Physical validation tests either use positions and velocities, or don't need the trajectory at all.GrommacsParser
read in xtc files? Some people might prefer to output the simulation trajectory as an xtc file rather than a trr file due to a smaller size. Those users could convert the xtc file to a gro file trajectory since reading a gro file trajectory seems to be the other option though.In the section of "Simulation data", the sentence "The different physical validation tests do not all require all data to be able to run" could probably be more clear. If I understand correctly, this just means that some tests (integrator and ensemble validation) only require mdp, top, gro and edr files, while some other tests (equipartition) did not require a gro file but the trajectory file (trr file). I have not made any changes for this sentence but I think maybe we could explicitly mention the required inputs for different tests. (Maybe we can use bullet points?)→ See #186For a lot of attributes list on the page of Data Format and Parsers, there are a lot of times that a certain function or attribute are mentioned but with an italic font instead of an inline code. Do we want to term them into inline codes (i.e. add→ Fixed by Wei-Tse!:attr:
or:func:
before each singly-quoted statement)? (Example:ObservableData.kinetic_energy
, the kinetic energy trajectory (nframes x 1), also accessible via .ObservableData[‘kinetic_energy’]) )This is minor as well, but I just felt that when plotting probability distributions of the samples obtained in the simulation, the number of bins could be higher to show its consistency (or inconsistency) with the expected distribution more easily (especially the ones in the kinetic energy equipartition example now shown in the documentation). This might involve how flexible we want our code to be in terms of specifying the settings of the plots.→ Binning has been improved since: #164Todos
r
at the beginning of line 13, which leads to the failure of the linting check. I've fixed that in the next commit.Status