wehs7661 commented 3 years ago

Description

As the final check of the documentation, in this PR I've fixed some typos and format issues and added additional explanatory texts in some of the rst files and jupyter notebooks. In the documentation, I've also indicated that the error estimates of the temperatures calculated in the non-strict kinetic distribution test will not be exactly reproducible if no random seed is set through the option bootstrap_seed. Overall, I think the documentation and the jupyter notebooks are very clear and straightforward. (There is an issue with one of the notebooks as mentioned below.) I've only found a few things that could be done to marginally improve the documentation. In the next section, I've listed some discussion points in the order of importance with the last few being the least important ones.

Discussion points

The first half (Trp-cage example) of the notebook kinetic_energy_equipartition.ipynb is not executable, which I believe the main reason is that the potential/kinetic energies are not read in such that simulation_data.observables is None and does not have the attribute kinetic_energy_per_molecule. However, the energy data is missing from trp-cage.py and I did not find any raw data of the simulation such that I could create arrays of energy and add them to trp-cage.py. I could potentially use the GROMACS parser to parse run.edr in the repo physical_validation_workshop, but in our new documentation, it was not clear which kind of thermostat was used and there are two options available in the workshop repo (be and vr), so I was not sure which set of simulation data I should use. → Resolved: Was due to an outdated version.
We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same. → Decided to leave this for now, probably out of scope of our documentation.
The issue with the previous documentation was that it relied on package-specific parsers too much. I'm wondering if in the new documentation we want to encourage creating SimulationData objects from python data. If so, we might want to talk more about creating an array for the argument mass to the SystemData object. Currently, the example shown in the documentation is the 900-water example, so the creation of the array mass is straightforward. However, for a more complicated example (like a protein-ligand binding complex), we might want to mention how to efficiently create such an array other than using the GROMACS parser. However, if we are just pointing out that it is possible to create SimulationData from python data even if it could be more tedious than using a package-specific parser, then I think it's reasonable to not mention the creation of the array mass. (The creation of such an array is pretty easy anyway so this issue might be trivial though.)
For the function physical_validation.kinetic_energy.equipartition, I was wondering if it would be useful that the user is allowed to specify molec_groups with a string like LIG, SOL, or any molecule names that can be found in the GROMACS topology file. That is, if molec_groups is a string, we parse the topology file and find the corresponding atom indices. If the given string is not shown in the topology file, the function returns an error. Or do we actually want the leave this task to the users such that they should parse their topology file by themselves to get the array to be passed to the function? This is not urgent anyway.
Can GrommacsParser read in xtc files? Some people might prefer to output the simulation trajectory as an xtc file rather than a trr file due to a smaller size. Those users could convert the xtc file to a gro file trajectory since reading a gro file trajectory seems to be the other option though. → Rejected: xtc files are not helpful, because they lack velocity data. Physical validation tests either use positions and velocities, or don't need the trajectory at all.
In the section of "Simulation data", the sentence "The different physical validation tests do not all require all data to be able to run" could probably be more clear. If I understand correctly, this just means that some tests (integrator and ensemble validation) only require mdp, top, gro and edr files, while some other tests (equipartition) did not require a gro file but the trajectory file (trr file). I have not made any changes for this sentence but I think maybe we could explicitly mention the required inputs for different tests. (Maybe we can use bullet points?) → See #186
For a lot of attributes list on the page of Data Format and Parsers, there are a lot of times that a certain function or attribute are mentioned but with an italic font instead of an inline code. Do we want to term them into inline codes (i.e. add :attr: or :func: before each singly-quoted statement)? (Example: ObservableData.kinetic_energy, the kinetic energy trajectory (nframes x 1), also accessible via .ObservableData[‘kinetic_energy’]) ) → Fixed by Wei-Tse!
This is minor as well, but I just felt that when plotting probability distributions of the samples obtained in the simulation, the number of bins could be higher to show its consistency (or inconsistency) with the expected distribution more easily (especially the ones in the kinetic energy equipartition example now shown in the documentation). This might involve how flexible we want our code to be in terms of specifying the settings of the plots. → Binning has been improved since: #164

Todos

[x] Go over the discussion points and make further changes as needed.
[x] I've just found that I accidentally added r at the beginning of line 13, which leads to the failure of the linting check. I've fixed that in the next commit.
[x] Clarify the sentence "The different physical validation tests do not all require all data to be able to run" in the section of "Simulation data".

Status

[x] Ready to go

ptmerz commented 3 years ago

The first half (Trp-cage example) of the notebook kinetic_energy_equipartition.ipynb is not executable, which I believe the main reason is that the potential/kinetic energies are not read in such that simulation_data.observables is None and does not have the attribute kinetic_energy_per_molecule. However, the energy data is missing from trp-cage.py and I did not find any raw data of the simulation such that I could create arrays of energy and add them to trp-cage.py. I could potentially use the GROMACS parser to parse run.edr in the repo physical_validation_workshop, but in our new documentation, it was not clear which kind of thermostat was used and there are two options available in the workshop repo (be and vr), so I was not sure which set of simulation data I should use.

I'm pretty sure that this is due to you not using the latest master version of physical_validation to run the notebook, because this looks like a bug I fixed in the last weeks. Can you verify this?

ptmerz commented 3 years ago

We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same.

The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?

ptmerz commented 3 years ago

The issue with the previous documentation was that it relied on package-specific parsers too much. I'm wondering if in the new documentation we want to encourage creating SimulationData objects from python data. If so, we might want to talk more about creating an array for the argument mass to the SystemData object. Currently, the example shown in the documentation is the 900-water example, so the creation of the array mass is straightforward. However, for a more complicated example (like a protein-ligand binding complex), we might want to mention how to efficiently create such an array other than using the GROMACS parser. However, if we are just pointing out that it is possible to create SimulationData from python data even if it could be more tedious than using a package-specific parser, then I think it's reasonable to not mention the creation of the array mass. (The creation of such an array is pretty easy anyway so this issue might be trivial though.)

That's a good point. Maybe we could just add a slightly more complex example right below? Like "Note that numpy offers a few nice tricks to create the mass vector efficiently. For example, for a protein-ligand binding complex, one could do [...]"?

ptmerz commented 3 years ago

For the function physical_validation.kinetic_energy.equipartition, I was wondering if it would be useful that the user is allowed to specify molec_groups with a string like LIG, SOL, or any molecule names that can be found in the GROMACS topology file. That is, if molec_groups is a string, we parse the topology file and find the corresponding atom indices. If the given string is not shown in the topology file, the function returns an error. Or do we actually want the leave this task to the users such that they should parse their topology file by themselves to get the array to be passed to the function? This is not urgent anyway.

That's a great suggestion, but also a prime example of why I think we should move away from parsers. It would be a non-trivial job to implement this, and then we'd likely still miss a few corner cases, and whenever GROMACS decides to change something we need to make sure to follow (and probably support both the old and the new version because adoption of new formats takes time). And then we should ideally be able to offer the same for LAMMPS, OpenMM, HOOMD, AMBER, and so on...

What would be great is if we could use MDTraj (or a similar package) to handle that. I think MDTraj might have the required knowledge of the topology, but unfortunately (as far as I know) still doesn't allow easy access to the velocities. But if you or someone else has some experience with MDTraj and could contribute an example that would be super useful. Otherwise I'd suggest to keep that as a goal for a future version.

ptmerz commented 3 years ago

Can GrommacsParser read in xtc files? Some people might prefer to output the simulation trajectory as an xtc file rather than a trr file due to a smaller size. Those users could convert the xtc file to a gro file trajectory since reading a gro file trajectory seems to be the other option though.

No, and it's probably not worth to add because xtc doesn't store velocities. The tests need either both the positions and the velocities (for the equipartition tests) or none of the two.

ptmerz commented 3 years ago

In the section of "Simulation data", the sentence "The different physical validation tests do not all require all data to be able to run" could probably be more clear. If I understand correctly, this just means that some tests (integrator and ensemble validation) only require mdp, top, gro and edr files, while some other tests (equipartition) did not require a gro file but the trajectory file (trr file). I have not made any changes for this sentence but I think maybe we could explicitly mention the required inputs for different tests. (Maybe we can use bullet points?)

Agreed that it can be clarified. I'll take a pass at that.

ptmerz commented 3 years ago

For a lot of attributes list on the page of Data Format and Parsers, there are a lot of times that a certain function or attribute are mentioned but with an italic font instead of an inline code. Do we want to term them into inline codes (i.e. add :attr: or :func: before each singly-quoted statement)? (Example: ObservableData.kinetic_energy, the kinetic energy trajectory (nframes x 1), also accessible via .ObservableData[‘kinetic_energy’]) )

Good point. Do you want to take a shot at that?

ptmerz commented 3 years ago

This is minor as well, but I just felt that when plotting probability distributions of the samples obtained in the simulation, the number of bins could be higher to show its consistency (or inconsistency) with the expected distribution more easily (especially the ones in the kinetic energy equipartition example now shown in the documentation). This might involve how flexible we want our code to be in terms of specifying the settings of the plots.

Agreed, I'm looking into that. I'll make this a separate PR if I find a satisfactory solution. Note that all the plotting is meant as a visual help, not necessarily as publication-worthy figures - it's very hard to find settings which look good under all circumstances. Do you think that we should emphasize that somewhere?

ptmerz commented 3 years ago

Thank you @wehs7661, this is extremely helpful!!! I tried to answer to the single points in separate comments above so we can keep an overview. Also, readthedocs is creating the documentation for PRs (this one is at https://physical-validation--163.org.readthedocs.build/en/163/index.html). It should post to the PR as well, I'm not sure why that doesn't work, will investigate.

ptmerz commented 3 years ago

Finally, I noticed that changes in the notebooks are extremely difficult to review. Maybe we should create an environment (maybe a docker image?) which we use to run the notebooks. That way the differences between versions would only be due to changes we make, instead of being a mix of our changes and our different environments.

(We could also upload the notebooks without running them and have them running on creation of the documentation, but for that we would need to understand if we'd run into some resource problems (some of the examples take quite a long time and some memory to run). Also, I like the fact that we can look at the notebook on Github and have some output right there.)

Until we've figured that out, could you give a quick overview of what you changed in them?

wehs7661 commented 3 years ago

I'm pretty sure that this is due to you not using the latest master version of physical_validation to run the notebook, because this looks like a bug I fixed in the last weeks. Can you verify this?

Yes, I've updated my package locally and it was the version issue. Now the notebook is executable!

wehs7661 commented 3 years ago

The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?

I think maybe mentioning exactly the way you got the data pasted in any file in the folder simulation_results would be a reasonable starting point?

wehs7661 commented 3 years ago

That's a good point. Maybe we could just add a slightly more complex example right below? Like "Note that numpy offers a few nice tricks to create the mass vector efficiently. For example, for a protein-ligand binding complex, one could do [...]"?

Yes, I agree. With MDAnalysis, it should be as straightforward as below:

import MDAnalysis as mda
import numpy as np 

u = mda.Universe('system.gro')
mass=np.array([u.atoms[i].mass for i in range(len(u.atoms))])

For a complex system where the users might only want to get the mass array of a certain molecule in the system, they could just adjust the range of the i value in the list comprehension. I'll add the above example in the documentation if you think this makes sense.

wehs7661 commented 3 years ago

What would be great is if we could use MDTraj (or a similar package) to handle that.

That's probably the easier route. I'm not sure about MDTraj, but with MDAnalysis we could do the following to create molec_groups given a gro and a tpr file:

import MDAnalysis as mda
import numpy as np 

u = mda.Universe('system.tpr', 'system.gro')
molec_groups = []
for i in range(len(u.segments)):
    seg = u.segments[i]
    molec_groups.append(np.array([seg.atoms[j].index for j in range(len(seg.atoms))]))

The tpr file was used so that the molecule groups defined in the GROMACS topology file are considered. If the users want to specify groups not defined in the topology file, they could consider using Universe.add_Segment as exemplified on this page.

For the code above and the one for creating the parameter mass for SystemData, I've validated their correctness using my own GROMACS simulation files. Do we want to provide the files as an working example or maybe we could just provide the code in the documentation as long as we are sure they work?

Also, I can add the code above to the docstring of physical_validation.kinetic_energy.equipartition as needed.

wehs7661 commented 3 years ago

Agreed that it can be clarified. I'll take a pass at that.

Okay. I've just added that as a to-do so that we remember to do so.

wehs7661 commented 3 years ago

Good point. Do you want to take a shot at that?

The issue has been fixed. Will push in the next commit.

wehs7661 commented 3 years ago

Do you think that we should emphasize that somewhere?

I think that could be helpful. Also, I think maybe we could additionally output the data required for plotting so then we don't have to worry about the settings of the plots anymore. The user could just use those outputs to plot the figure with whatever settings they prefer.

wehs7661 commented 3 years ago

Until we've figured that out, could you give a quick overview of what you changed in them?

I'm not familiar with docker images, but here are the diffs of the notebooks (except for different results produced due to different versions of the package or lack of a random seed): doc/examples/ensemble_check.ipynb:

Before: The results imported here are results from example simulations which are stored in another Python file.
After: The results imported here are results from example simulations, which are the time series of kinetic and potential energy stored in another Python file.

doc/examples/integrator_validation.ipynb:

Before: The results imported here are results from example simulations which are stored in another Python file.
After: The results imported here are results from example simulations, which are the time series of kinetic and potential energy stored in another Python file.

doc/examples/kinetic_energy_distribution.ipynb:

Before: The results imported here are results from example simulations which are stored in another Python file.
After: The results imported here are results from example simulations, which are the time series of kinetic and potential energy stored in another Python file.
Before: Note that any deviation within about 2-3 standard deviations is usually considered as statistically insignificant.
After: Note that the errors of the temperatures were computed using bootstrapping, which used random sampling with replacement to mimic the trajectory data. That is, the error estimates of the temperature (hence the returned tuple) could be slightly different in different trials unless one set a random seed (using the bootstrap_seed option). Any deviation within about 2-3 standard deviations is usually considered as statistically insignificant.
Before: This means that especially for simulations which rely on correct sampling of the tail regions, the Berendsen temperature coupling is likely to introduce artefacts.
After: This means that especially for simulations which rely on correct sampling of the tail regions, the Berendsen temperature coupling is likely to introduce artefacts. Again, the tuple returned here could be slightly different in different trials since no random seed was set.

doc/examples/kinetic_energy_equipartition.ipynb:

Before: The results imported here are results from example simulations which are stored in another Python file.
After: The results imported here are results from example simulations, which are the time series of the kinetic/potential energy, position and velocity stored in another Python file.
Before: To demonstrate this checks, we will first look at a simulation of a Trp-cage mini-protein.
After: To demonstrate this check, we will first look at a simulation of a Trp-cage mini-protein.

ptmerz commented 3 years ago

We might want to show how the trajectory data (kinetic energy, potential energy, and total energy) can be obtained from a python interface of the simulation package (maybe at least for GROMACS). This might not be immediately straightforward to the beginners of the field. We could potentially add a notebook showing different ways of reading in the simulation data and show that the results of importing the data are the same.

The idea was that https://physical-validation.readthedocs.io/en/latest/simulation_data.html, which is referenced on top of every notebook, would be exactly that. Do you have some suggestions on how to improve this?

I think maybe mentioning exactly the way you got the data pasted in any file in the folder simulation_results would be a reasonable starting point?

No, this doesn't really work. The examples should substitute a real interface, how they were created is irrelevant. GROMACS does not have a Python interface, that's why we have a parser. We should add examples for other packages, though, that do have a Python interface.

codecov[bot] commented 3 years ago

Codecov Report

Merging #163 (f04d186) into master (d79b35d) will not change coverage. The diff coverage is n/a.

ptmerz commented 3 years ago

I did a rebase and tried to remove as much noise as possible from the ipynb to allow for review.

ptmerz commented 3 years ago

I implemented @wehs7661 remaining suggestions in a commit. This should be good to go now. There will be a follow-up to clean up some references which are out-of-date, but I think this PR has been large enough also like this :)

shirtsgroup / physical_validation

Final check of the documentation #163

Description

Discussion points

Todos

Status

Codecov Report