openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[REVIEW]: pygen-structures: A Python package to generate 3D molecular structures for simulations using the CHARMM forcefield #2157

Closed whedon closed 4 years ago

whedon commented 4 years ago

Submitting author: @thesketh (Travis Paul Hesketh) Repository: https://github.com/thesketh/pygen-structures Version: v0.2.4 Editor: @richardjgowers Reviewer: @dotsdl, @rmeli, @amandadumi Archive: 10.5281/zenodo.3739128

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/57dce0c14dd34c111f89077cd367dbd7"><img src="https://joss.theoj.org/papers/57dce0c14dd34c111f89077cd367dbd7/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/57dce0c14dd34c111f89077cd367dbd7/status.svg)](https://joss.theoj.org/papers/57dce0c14dd34c111f89077cd367dbd7)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@dotsdl & @rmeli & @amandadumi, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @richardjgowers know.

Please try and complete your review in the next two weeks

Review checklist for @dotsdl

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @rmeli

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @amandadumi

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @dotsdl, @rmeli, @amandadumi it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 4 years ago

PDF failed to compile for issue #2157 with the following error:

Error reading bibliography ./paper.bib (line 75, column 3): unexpected "d" expecting space, ",", white space or "}" Error running filter pandoc-citeproc: Filter returned error status 1 Looks like we failed to compile the PDF

richardjgowers commented 4 years ago

@thesketh looks like three's an issue with the bibtex

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

PDF failed to compile for issue #2157 with the following error:

Error reading bibliography ./paper.bib (line 75, column 3): unexpected "d" expecting space, ",", white space or "}" Error running filter pandoc-citeproc: Filter returned error status 1 Looks like we failed to compile the PDF

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

RMeli commented 4 years ago

I had a first look at the package and have a few comments, none of which I deemed worth of its own issue for now (besides, some issues were already open). I will have a deeper look in the upcoming days, but I thought it might be useful to have some initial comments earlier.

(@richardjgowers if you prefer to see the comments in bulk do let me know and I'll adjust for next time)

Functionality

Installation

Installation is easy and described in the README. A few comments:

Documentation

Installation

See comments above.

Detailed installation instructions could be added to the main documentation as well (the website is already in place but only the API documentation is present).

Automated Tests

Automated tests can be easily run with pytest and all pass. However, there is no test coverage displayed. Using pytest-cov I can see that some modules have rather low coverage (convenience_functions.py is at 41%). I think the test coverage metric should be added and improved for some modules.

(See also thesketh/pygen-structures#3)

Functionality Documentation

Documentation for the objects returned by different functions is often missing.

Example Usage

There is a single usage example. I think more usage examples should be added.

It would be useful to describe the different input in more details within the documentation with the help of usage examples.

(See also thesketh/pygen-structures#2; as noted there, the documentation skeleton is already in place and should be relatively straightforward to add an "Usage Examples" section)

Misc

I noticed that the CHARMM parameter files are included under pygen_structures/toppar. Is there an associated license for such files?

richardjgowers commented 4 years ago

Hey Rocco, thanks for the feedback, I think these points are probably good to be raised as issues too

On Tue, Mar 10, 2020 at 22:34, Rocco Meli notifications@github.com wrote:

I had a first look at the package and have a few comments, none of which I deemed worth of its own issue for now (besides, some issues were already open). I will have a deeper look in the upcoming days, but I thought it might be useful to have some initial comments earlier.

(@richardjgowers https://github.com/richardjgowers if you prefer to see the comments in bulk do let me know and I'll adjust for next time) Functionality Installation

Installation is easy and described in the README. A few comments:

  • Since RDKit can only be installed via conda I would remove the requirements.txt file
  • I would add a second .yml defining a full environment for testing (so that the user can avoid searching for openmm installation instructions)
  • Adding a dedicated "Installation" section to the README could be beneficial

Documentation Installation

See comments above.

Detailed installation instructions could be added to the main documentation as well (the website is already in place but only the API documentation is present). Automated Tests

Automated tests can be easily run with pytest and all pass. However, there is no test coverage displayed. Using pytest-cov I can see that some modules have rather low coverage (convenience_functions.py is at 41%). I think the test coverage metric should be added and improved for some modules.

(See also thesketh/pygen-structures#3 https://github.com/thesketh/pygen-structures/issues/3) Functionality Documentation

Documentation for the objects returned by different functions is often missing. Example Usage

There is a single usage example. I think more usage examples should be added.

It would be useful to describe the different input in more details within the documentation with the help of usage examples.

(See also thesketh/pygen-structures#2 https://github.com/thesketh/pygen-structures/issues/2; as noted there, the documentation skeleton is already in place and should be relativelystraightforward to add an "Usage Examples" section) Misc

I noticed that the CHARMM parameter files are included under pygen_structures/toppar. Is there an associated license for such files?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2157?email_source=notifications&email_token=ACGSGBZHAHOCB5EP5JVL4E3RG2565A5CNFSM4LFDODLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEONOIBA#issuecomment-597353476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4HGDUH3T6QDFDL2OTRG2565ANCNFSM4LFDODLA .

thesketh commented 4 years ago

Thanks for the preliminary comments, Rocco. I'll address these as soon as I can. I think I can fix the poor coverage of convenience_functions by comparing code_to_mol and sequence_to_mol with a manually created Molecule.

The requirements.txt file is necessary for PyPI distribution. I could remove RDKit from the requirements and catch the ImportError at runtime to produce a more helpful error, but I thought this was a dirtier solution.

The toppar distribution is in the public domain, but I'll add another README in this folder explaining where the files are from, where new versions can be obtained, and what changes have been made from the standard distribution (I've noticed some issues with the D-amino acid parameters, which are fixed in the 0.2.3 branch).

Documentation for return types is missing in some cases because from __future__ import annotations is unavailable in Python 3.6, but I'll add :rtype: tags where this is the case.

RMeli commented 4 years ago

Thanks for clarifying the situation with toppar; the lack of license on their end is a bit strange... In the light of the forum discussion you linked, what you suggest to do sounds very reasonable.

I'm not well versed with PyPI distribution, but I have a package on PyPI without any requirements.txt. I think PyPI deployment only uses install_requires in setup.py (see here):

install_requires should be used to specify what dependencies a project minimally needs to run. When the project is installed by pip, this is the specification that is used to install its dependencies.

If I pip-install pygen_structures on a clean environment, the package is installed but neither numpy nor rdkit are (since install_requires is not specified). I think you can remove requirements.txt, add numpy to install_requires and leave rdkit installation to the user as described in the documentation (I think the error ModuleNotFoundError: No module named 'rdkit' is clear enough). Other reviewers might have different opinions and are probably more experienced with PyPI deployment, so I'd wait for their comments before making any change.

thesketh commented 4 years ago

Thanks for the tip about the install_requires, it seems that there's even a page on "install_requires vs requirements files". I was under the impression that requirements.txt was necessary. I think you're probably right and that deleting it is the right thing to do, but I'll hold off for now.

I think I've addressed most of your initial comments now (with the extra information added to README), but I still need to add the information to the main documentation.

arfon commented 4 years ago

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team.

thesketh commented 4 years ago

Thank you for your comment, @arfon . It is refreshing to see a journal which cares about its authors, editorial staff, and reviewers.

In Scotland, we are relatively unaffected (as of yet) but I implore the editor and the reviewers to take the time they need in light of the current crisis. This submission can wait, your health can not. If you have comments to make, they are appreciated, but please focus on your health first and foremost.

amandadumi commented 4 years ago

I am still making my way through the code, but I had another thought about the handling of the additional testing dependency. I agree that avoiding having users search for openmm installation instructions would be good. @RMeli provided this solution, which was nice!

* I would add a second `.yml` defining a full environment for testing (so that the user can avoid searching for `openmm` installation instructions)

Another way that this can be done is to define these as extras in the setup.py as described here. I just wanted to mention this since this would avoid an extra file and could then be installed from pypi as described here. What do you think, @RMeli? I haven't dealt with this first hand so I'm not sure if your approach has a benefit I am overlooking.

RMeli commented 4 years ago

@amandadumi I think that would be a good idea, but it seems to me that in order to install OpenMM conda is needed (with the -c omnia -c conda-forge channels).

richardjgowers commented 4 years ago

Iirc there was an effort to get a minimal openmm working on conda-forge... the hang up was cuda dependencies

On Sun, Mar 15, 2020 at 20:03, Rocco Meli notifications@github.com wrote:

@amandadumi https://github.com/amandadumi I think that would be a good idea, but it seems to me that in order to install OpenMM http://docs.openmm.org/latest/userguide/application.html#installing-openmm conda is needed (with the -c omnia -c conda-forge channels).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2157#issuecomment-599257354, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4X4PQU3WJD7KE5PLDRHUYCDANCNFSM4LFDODLA .

richardjgowers commented 4 years ago

By @jaimergp

On Sun, Mar 15, 2020 at 20:06, Richard Gowers richardjgowers@gmail.com wrote:

Iirc there was an effort to get a minimal openmm working on conda-forge... the hang up was cuda dependencies

On Sun, Mar 15, 2020 at 20:03, Rocco Meli notifications@github.com wrote:

@amandadumi https://github.com/amandadumi I think that would be a good idea, but it seems to me that in order to install OpenMM http://docs.openmm.org/latest/userguide/application.html#installing-openmm conda is needed (with the -c omnia -c conda-forge channels).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2157#issuecomment-599257354, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4X4PQU3WJD7KE5PLDRHUYCDANCNFSM4LFDODLA .

RMeli commented 4 years ago

@thesketh I didn't have enough time to dig deep into the code, but I have some comments about the paper below.

Software Paper

Statement of Need

A big chunk of Fig. 1 details "planned future work". This gives the impression that pygen-structures is incomplete and there is still a lot of work to do. I'm wondering if it would be better to focus on what the program does instead of outlining a lot of planned things that could be far away in the future. Generating coordinates for missing residues is a big project in itself and you already underline this limitation in the conclusions. @richardjgowers might have some advice on this.

Summary

I'm not sure about the term "arbitrary molecules", since you can only generate molecules from CHARMM residues and patches.

Problems with Current Workflow

I think the problems with current workflows are very well illustrated. Despite only two packages are discussed (RDKit and MarvinSketch), I think similar problems will arise with other packages.

Future Work

I think it's great that you outline the limitation of pygen-structures compared to psfgen and possible big additions to the functionality of the software. However, I think that the documentation of the current features should be complete and not left as future work.

Dependencies

Acknowledgements

Misc

jaimergp commented 4 years ago

Hi! Yes, we have a functional openmm recipe on conda-forge, but still not publicly available (it's on the testing label). We have to sort out how to deal with potentially unsupported OpenCL implementations (check https://github.com/conda-forge/cfep/issues/17). In the meantime, the omnia channel is needed, but as you can see in the issue this is subject to change in the near-term future.

thesketh commented 4 years ago

@whedon generate pdf

Thank you both for your comments! I'm aware of the attempts to get OpenMM onto conda-forge, but I use OpenMM quite extensively in my research so I'm following this as well.

@amandadumi I've addressed this now in the installation instructions (and fixed the issue you opened about the RDKit installation instructions), but sadly I don't think there's an easy way to get around using conda for the dependencies. It's a shame there isn't a tie-in between pip and conda.

@RMeli I think I've fixed some of the issues you noted with the paper now.

Statement of Need

The "Planned future work" was intended to note that there's another workflow (working with large molecules from the PDB) which currently isn't supported. I've updated the figure to state "Other software" instead. For perfectly formed PDB structures, filling missing hydrogen atoms (and even missing heavy atoms if the surrounding atoms are present) should in principle be easy, but the RDKit currently gives a segmentation fault on larger protein structures (I tried HEWL without success). I'm going to investigate using OpenBabel as a backend as well to address this, but I think it's outside the scope of this paper.

Summary

'arbitrary' has been removed.

Problems with Current Workflows

I'd like to fit the contents of the PDB inside the page margins as well, would it be best to use a line continuation character such as \ and elipses on the following line? This is the main reason why I haven't updated the PSF examples to the current format (which uses the EXT format) - they're much wider.

I've noted similar behaviour from Avogadro (correctly identifies residue, but calls both the L and D forms the L-amino name) and GaussView (behaves similarly to MarvinSketch), but have omitted these examples to avoid making the paper too long.

Future Work

Yes, you're right of course. I've updated this section to reflect this. I still think that further documentation of the polymeric structures included in CHARMM would be useful (e.g. is it possible to make triglycerides from glycerol/fatty acids? Are there glycolipid linkages defined?) but I think this may be better placed in the CHARMM documentation - CHARMM-GUI may already have some information about this.

Dependencies

These have been updated, I've added a citation for PyTest as well.

Acknowledgements

I'm a postgraduate student funded by the University of Strathclyde, but the work on this project has been done in my own time so I don't think that this is applicable here.

Misc

Is there a particular YAML field that this extra information should go under? The example papers only contain university name and department name.

Regarding italicisation, I've italicised all software packages on first invocation, but not afterwards. I couldn't see any guidelines about how best to format this. I can italicise all references to software packages if that would be more consistent

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

It would help if I'd pushed the changes to the flowchart.

@whedon generate pdf

amandadumi commented 4 years ago

@thesketh and @Rmeli, this is clear now. I do not use openmm and was didn't realize the openmm pypi project was simply for a user-friendly interface.

I was also able to go through the paper and had some general comments, many of mine echo what had previously been said, but I had a few specifics that applied to the most recent article proof. I don't think the call to whedon to include the updated flowchart worked though.

Software paper

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?

State of the field: Do the authors describe how this software compares to other commonly-used packages?

Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?

Small recommendations:

References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)?

arfon commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

arfon commented 4 years ago

@amandadumi - the Whedon calls need to be in their own comment ☝️

dotsdl commented 4 years ago

Hi @thesketh, here are my review notes. Great work on this submission; I believe this to be a good fit for JOSS. I am satisfied that much of the criteria for submission are met, and so the notes below focus heavily on items that I think should be addressed yet.

High-level notes

  1. Nice tool! I find the CHARMM forcefield tooling to be particularly difficult to work with, so this is a welcome addition to the Python tooling for this area.
  2. A lot of the functionality appears to be planned, but is not currently implemented.
  3. The project is less than a month old, with a single contributor. This is not necessarily a problem, but it is an indication to me that this library may benefit from more development before publishing in JOSS.
  4. This tool looks very useful to CHARMM users who need to work with small amino acid sequences. It would be of limited utility outside of either of those needs. This is not necessarily a problem, but a note on the scope of the community it serves.

Review (2020.03.16)

Functionality

  1. Only appears to work for short segments (<15 residue from my ad-hoc tests). This should probably be made more explicit.
    • appears to be addressed in README.
  2. The tool works quite well for short segments, and is much easier to use than other tools that can give similar results.

Documentation

  1. Essentially all documentation is in the README. What is present is a small but good set of examples on usage of the command-line tool.
  2. There isn't really any documentation on how to use the library components, or what the API is.
    • I would like there to be some API components called out for those that would like to use the library as part of larger frameworks, not just as a command-line tool.
    • the paper gives one example of library component usage. It would be better to see some of this in the library docs themselves as well.

Software paper

  1. Only appears to work for short (<15 residue) segments. This should probably be made more explicit in the paper.
    • appears to be addressed in README.
thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

Thank you both for your comments!

@amandadumi I think I've addressed most of these now. I haven't changed the sentence containing "the RDKit" as this is how the package is referred to in its documentation.. I read it as "the Rational Discovery Kit" in my head.

Yes, in the past I've encountered similar behaviour from different 2D and 3D chemical drawing packages. I don't have access to a Windows/Mac computer to check how ChemDraw behaves, and I haven't tried to use PyMol before, but I'll try that and add the citations for these packages and Avogadro.

@dotsdl I submitted the paper at this stage because I believe the introduced functionality - in particular the ability to generate annotated structures for polysaccharides, lipids or protein-sugar linkages - is unique. Other additions would increase usability, but are not limiting progress towards an automated workflow. Much of the planned functionality is "would be nice to have", but is covered by existing software packages (e.g. Modeller to assign positions for missing residues, psfgen to handle protein structures from the protein databank and fill missing atoms, VMD's solvate plugin to handle creation of simulation boxes, and the martinize.py script to coarse grain protein structures for the MARTINI forcefield).

The project has been on GitHub for around a month, but I've been working on it piecemeal for a few months. I am still the only contributor so far, but I'd like to encourage further contribution. You are right that the package is tied in to the CHARMM ecosystem, do you think I should make this more clear?

I'm working on addressing the lack of documentation for library usage, and I've updated the paper to be more explicit about the requirement for structures to be small.

amandadumi commented 4 years ago

@thesketh ha! I have never noticed that in RDKit documentation! Thanks for explaining, that makes sense then.

amandadumi commented 4 years ago

Documentation

@thesketh, your README is very well done and clear. I am also very happy to see the use of typing in your code! I found it very useful. Additionally, I found that your function commenting was thorough!

Two small issues:

Code

RMeli commented 4 years ago

@thesketh Thanks for addressing our comments on the paper (I think the limitation on the number of residues, outlined by @dotsdl, is still missing?). The only "issue" I have left with the paper is the overflow of the code with respect to the text margins. @richardjgowers will there be any editing after the review, @thesketh has to take care of it, or it is OK as it is?

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

Thank you both for your comments.

@amandadumi Well spotted with " instead of '! That will teach me not to use find and replace to fix inconsistent string quotes. As for the module descriptions appearing at the bottom, I'm unsure why they're in the wrong place but I'll try to adjust it. These are autogenerated from the code comments, so should be fixed when I add the documentation for use as a library function. I've added citations for the other drawing packages as you requested.

@RMeli I've added an explicit reference to the maximum number of residues, and I fixed the long lines in the code examples by putting in line continuation characters (and elipses on the following line). I think this should be clear enough, but I can update this if you think there's a better way to show that the lines are wrapped or if you think it should be mentioned somewhere in the text.

I think I've now made all the recommended changes to the paper itself, I'll try to get the library documentation posted early this week.

amandadumi commented 4 years ago

@thesketh All you will need to do is to go into each of the pages content description within the doc/source/ folder. The order of the content for each page is dictated by these files. Moving the "Module Contents" section to the top should do the trick!

thesketh commented 4 years ago

Thanks for the tip, @amandadumi! I ended up re-generating the RST files using sphinx-apidoc, the -Mflag put them in the correct order.

I believe the docs on ReadTheDocs have a fairly comprehensive introduction to library usage on them now.

RMeli commented 4 years ago

@thesketh the documentation is looking great now, well done!

A few comments below:


--name and --segid control the names given in the COMPND record and the segment ID (4 character max) respectively.

When I try to use a --segid longer than 4 characters, the code executes normally without warning. I think this is ok (since user should know this limitation), but maybe it is worth adding a warning that the name will be truncated just in case?


Under the Library Usage - Basic Usage section there is an error in the 4th code block:

patches = {'RAFF', [0, 1, 2]}  # Should be a dictionary: {'RAFF': [0, 1, 2]}

I get True here:

>>> mol = ps.code_to_mol('AdP', rtf)
>>> # This will pass verification in 0.2.3
>>> mol.check_parameters(prm)
False

I tried to see which version of pygen_structures I installed by printing ps.__version__ (I installed 0.2.3 of course, as hinted by the comment) but AttributeError: module 'pygen_structures' has no attribute '__version__'; I think it's worth adding this attribute to the package.

thesketh commented 4 years ago

Thanks for your comments, @RMeli. I've added a warning for segids greater than four characters on the Molecule class now.

Well spotted on the wrong type in the doccs - I fixed it in the example before but missed it in that one!

The version string was in pygen_structures.version.version before, but I've added it to pygen_structures.__version__ as well.

thesketh commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

thesketh commented 4 years ago

@richardjgowers I've released another version (v0.2.4) on PyPI to reflect @RMeli's recommended changes. I've updated the examples in the article to reflect the new version, could you update the version with whedon? I believe it's an editor-only command

arfon commented 4 years ago

@whedon set v0.2.4 as version

whedon commented 4 years ago

OK. v0.2.4 is the version.