JOSS Review - Githubissues

Hi all,

Thanks for the invitation to review and congrats on the submission.

The general idea behind this submission is sound, and follows-up on a 2021 publication from the same authors on E2EDNA v1.0, published in JCIM. From my understanding, the code is essentially a re-write to use OpenMM instead of Tinker as the MD engine. While this is valuable - makes it simpler to install/run - the authors do not realize, in my opinion, this change to its fullest potential. The authors repository is not so much a "package" in the traditional sense, but more of a collection of scripts that automate a certain rigid protocol. I would rather see for instance, NUPACK being an optional dependency - as a user, I could simply provide my own DNA molecules instead of being forced to use NUPACK. In this sense, I think this repository could use more work to stand out on its own compared to last year's publication.

In addition to this comments, I have a general comment on the repository itself. The authors should take some time to clean up files that are no longer useful for the protocol or that are simply part of the development workflow. Folders named old, or IDE config folders (.idea) should not be part of a published version of the repository, specially when they are even marked to be ignored in the .gitignore file. Same with the existence of both a requirements.txt file and an environment.yml file, whereas only the latter is used. As such, I believe that the authors should spend some time cleaning up the repository and setting up a more "traditional" structure to help potential users navigate through their code base more easily.

Further, I have a few starter questions about the manuscript, code and, licenses that I think should be clarified. Hopefully these will help the authors improve their work and repository/code.

Licenses

You're licensing the tool under the Apache license but you are including data (parameter sets) that falls under a difference license. In particular, I see the parameter files for the Amoeba forcefield taken from Tinker/OpenMM almost verbatim. Did you check with the appropriate developers if this sharing of the forcefield parameter files is allowed under their license, without any attribution?

Installation

The installation process is quite complex. As a user, I'd have to register and download NUPACK and MMB, as well as edit a series of files in order to get a functional installation. This is simply a suggestion for the developers to keep in mind.
Related to the point above, have the authors considered using conda directly to install their software, instead of a custom shell script? pdbfixer is available as a conda package, and you could specify pip packages there too, e.g. lightdock. The installation could be reduced to a simple: 1) install nupack 2) install mmb 3) run conda env create -f e2edna-env.yml.
On this last point, the authors should strip the granular version of the env yaml file otherwise conda will struggle with versions on anything but the authors' hardware.
According to the README, the code is only tested on MacOS, although I'd imagine the most use would be on a compute cluster running Linux. Have the authors tried running their code on Linux?

Misc

In several sections of their documentation, the authors mention "OpenDNA". Was this the previous name of this package?
It would be greatly beneficial for a user to have config files with installation paths, simulation settings etc, instead of having to edit source code. Would the authors be open to this change?

Comments on the Manuscript

In the "Statement of Need", the authors mention an "all-python" package several times. Being pedantic, this is not entirely true as their code relies on quite some compiled code in their dependencies (lightdock, openmm).

Thanks very much for the comments.

(1) About comparison between E2EDNA (JCIM) and E2EDNA 2.0 (this manuscript):

The general idea behind this submission is sound, and follows-up on a 2021 publication from the same authors on E2EDNA v1.0, published in JCIM. From my understanding, the code is essentially a re-write to use OpenMM instead of Tinker as the MD engine. While this is valuable - makes it simpler to install/run - the authors do not realize, in my opinion, this change to its fullest potential. The authors repository is not so much a "package" in the traditional sense, but more of a collection of scripts that automate a certain rigid protocol. I would rather see for instance, NUPACK being an optional dependency - as a user, I could simply provide my own DNA molecules instead of being forced to use NUPACK. In this sense, I think this repository could use more work to stand out on its own compared to last year's publication.

Our response:

Compared to the original E2EDNA, we added more flexibility to E2EDNA 2.0 surrounding the 8 simulation modes. Specifically, we designed several "toggles" on top of these modes to allow users to customize their simulation protocol depending on their demand. One of the toggles, params['skip MMB'], was designed exactly for the purpose the reviewer mentioned, which is to skip NUPACK and MMB if a user can provide a folded structure of DNA aptamer. Another example is params['pick up from freeAptamerChk'] which allows a user to continue an unfinished MD sampling of free aptamer from a checkpoint file. This prevents a user from having to start the pipeline from scratch if a lengthy MD sampling is stopped for some reason.

We also would like to point out that we did not intend to provide every possible way of customizing the simulation pipeline, because there could be plenty of situations different users might encounter. We limited ourselves to developing a finite set of features for the common situations that we reasoned our potential users might run into and drew a line there for this submission. Meanwhile, we encourage anyone who is interested in making contributions to dig in our code, add features or improvement, as stated in the CONTRIBUTING.md (link). So far there has been some interest in doing so.

(2) About cleaning up the code base:

In addition to this comments, I have a general comment on the repository itself. The authors should take some time to clean up files that are no longer useful for the protocol or that are simply part of the development workflow. Folders named old, or IDE config folders (.idea) should not be part of a published version of the repository, specially when they are even marked to be ignored in the .gitignore file. Same with the existence of both a requirements.txt file and an environment.yml file, whereas only the latter is used. As such, I believe that the authors should spend some time cleaning up the repository and setting up a more "traditional" structure to help potential users navigate through their code base more easily.

Our response: Thanks for the suggestion to clean up the repository. We have done so in a branch called “JOSS” for this submission which contains the cleaned version of the code base and our manuscript. But unfortunately because of a misunderstanding the reviewer was directed to the wrong branch. Please take a look at the JOSS branch for a version that is up to the standards that the reviewer has requested.

Changes to the code base: Unnecessary folders and files have been further deleted from the JOSS branch, which are .idea, .gitignore and .gitattributes.

(3) Licenses:

You're licensing the tool under the Apache license but you are including data (parameter sets) that falls under a difference license. In particular, I see the parameter files for the Amoeba forcefield taken from Tinker/OpenMM almost verbatim. Did you check with the appropriate developers if this sharing of the forcefield parameter files is allowed under their license, without any attribution?

Our response:

The parameter sets mentioned by the reviewer are in the main branch, which is not meant to be used for this submission. In the JOSS branch (for this submission) which contains the cleaned version of the code base, all those parameter files were deleted therefore will not be shared if our submitted work gets published.

We chose Apache-2 license which does not include the condition of "same license" that GPLv3 does (shown in the help website of choosing a license provided by GitHub). For this reason, I think that the Apache-2 license could give more flexibility to our software than GPLv3 as our E2EDNA (current and future versions) might be used by a larger software. But I am not a license expert and this is my own opinion. Meanwhile, we are in contact with the developer of LightDock (licensed under GPLv3) to work out if licensing E2EDNA 2.0 with Apache-2 would cause any conflict (here). The other main external software packages used in the E2EDNA 2.0 (NUPACK, MMB, and OpenMM) have relatively permissive licenses like MIT license and support non-commercial academic use.

(4) Installation

The installation process is quite complex. As a user, I'd have to register and download NUPACK and MMB, as well as edit a series of files in order to get a functional installation. This is simply a suggestion for the developers to keep in mind.

Our response: I agree with the reviewer on the hurdles of installing NUPACK and MMB. We thought about encapsulating everything into a python package that a user can simply import but did not proceed that way because we would like our simulation pipeline to be open for user customization including allowing a user to change the software package used in each module.

Related to the point above, have the authors considered using conda directly to install their software, instead of a custom shell script? pdbfixer is available as a conda package, and you could specify pip packages there too, e.g. lightdock. The installation could be reduced to a simple: 1) install nupack 2) install mmb 3) run conda env create -f e2edna-env.yml.

Our response: Yes, we used conda to install the software in the JOSS branch, as shown in the "macos_installation.sh". We basically carried out the same three steps as the reviewer suggested in a slightly different order. The bash script can be found here.

Change to the code: pdbfixer and lightdock packages have been added to the e2edna-env.yml file.

On this last point, the authors should strip the granular version of the env yaml file otherwise conda will struggle with versions on anything but the authors' hardware.

Our response: Thanks for the suggestion. We have updated the e2edna-env.yml by removing all the "build specification" strings. The “e2edna-env.yml” now only contains names and versions of packages and their dependencies.

According to the README, the code is only tested on MacOS, although I'd imagine the most use would be on a compute cluster running Linux. Have the authors tried running their code on Linux?

Our response: Yes, we have had this code installed and running on a Linux cluster. We provided installation instructions and automated tests on macOS only for review purposes but there are no inherent limitations on the OS.

(5) Misc

In several sections of their documentation, the authors mention "OpenDNA". Was this the previous name of this package?

Our response: Yes, “OpenDNA” was an older name before we settled down on “E2EDNA 2.0”. Also, the files which mentioned “OpenDNA” were not necessary for this submission therefore were deleted in the JOSS branch.

It would be greatly beneficial for a user to have config files with installation paths, simulation settings etc, instead of having to edit source code. Would the authors be open to this change?

Our response: Thanks for the suggestion. Yes, we are open to set up such config files (such as in .yml format).

(6) Comments on the Manuscript

In the "Statement of Need", the authors mention an "all-python" package several times. Being pedantic, this is not entirely true as their code relies on quite some compiled code in their dependencies (lightdock, openmm).

Our response: Thanks for pointing it out. We agree that some modules are not fully written in python but the code is easy to install and interface with other python modules. For this reason we have rephrased “all-python” to "python-interfacing".

Changes to the manuscript: Updated two places where “all-python” had been used: “all-python simulation package” and “all-python implementation”

Hi @taoliu032,

Thanks for your answers.

Would it be possible to make available an example that does not require the dependencies users have to install manually? I'd like to test the software before giving it a 👍 for submission but I don't want to register for and install several dependencies on my laptop just for this purpose.

Hi @JoaoRodrigues, Thanks for your comment.

To clarify: I think that the two dependencies you referred to are NUPACK and MMB. If so, yes it is possible to bypass both of them by providing the software with a DNA structure in pdb format.

As you requested, I have made an example that does not require the dependencies of NUPACK and MMB. To do so, I needed to slightly modify the code, such as removing the import statement from nupack import *. So, to avoid having conflicts with the other reviewer's review process, I created another branch for the modified code, called JOSS_test_without_requiring_nupack_and_MMB.

In this branch, "foldedAptamer_0.pdb" is the example DNA aptamer structure for the test. Regarding how to run the test on your local computer: for convenience, after setting up the conda environment and workdir path in main.py, you can simply run the last test in the automate_tests.sh. It will test all the modules (excluding the bypassed two) of the software. For reference, the corresponding command is also copied here: python main.py --run_num=8 --mode='full binding' --aptamerSeq='TAATGTTAATTG' --ligand='YQTQ.pdb' --ligandType='peptide' --ligandSeq='YQTQTNSPRRAR'

Hi @taoliu032, thanks for the quick update, I'll give it a shot.

In the meantime, I'd suggest you make NUPACK and MMB runtime dependencies, maybe supported by a command-line flag? That way users won't have to modify the source code to run part of the workflow. This goes back to my previous comment of allowing users to specify parameters for the run in a config file.

This type of dependencies is usually implemented by isolating the part of the code requiring the third-party code in question in a function or class, and then having the import there in a try/except statement, .e.g:

def run_nupack(....):
    try:
        import nupack
    except ImportError as err:
        raise ImportError("Required module 'nupack' not found. Please install from https://....") from err

    # run nupack-dependent code
    ...

In addition, here are some comments from installing and running the software:

Running the macos_installation.sh script fails on my machine because I did not install nupack. The script has a hardcoded path for a folder. Please refactor this. I realize this is "fixed" in the branch you pointed me to, but there should be a general solution for this problem.
You should consider having a folder called examples where you put the input data files needed to run particular workflows and maybe a shell script with the options. This will help users get started using your software.
The command you wrote me before failed with an error: FileNotFoundError: [Errno 2] No such file or directory: '/absolute-path-to-E2EDNA2-JOSS/localruns/run8'. Please, do not hardcode values or paths. Use configuration files. Do not expect users to edit your source code.
Related to the above. I had to edit the main.py file in a couple of lines and then create two separate folders (scratch/ and localruns/) in order for the code to run. If I was a prospective user, I'd be very confused about what to do and would most likely give up. Please, consider supper user-configuration files, or at the very least separating configurations from code in a "global" yaml file for these top-level configs. Like this, will users have to copy the repository every time they want to run a new system? Or try a different set of parameters?
The workflow should be pretty clear about what files are temporary/intermediate and what files the user should look at at the very end. There is a record.txt file that I assume is a log file, and should be renamed as such for clarity. Other than that, I cannot understand, from the log file and the output, which files I should look once the simulations finish. You could either write that explicitly in the log file, separate the different stages in sub-folders in the run folder, or simply rename the files "stage_0", "stage_1", ..., "final_output" or something like that.

This type of dependencies is usually implemented by isolating the part of the code requiring the third-party code in question in a function or class, and then having the import there in a try/except statement

The command you wrote me before failed with an error: FileNotFoundError: [Errno 2] No such file or directory: '/absolute-path-to-E2EDNA2-JOSS/localruns/run8'. Please, do not hardcode values or paths. Use configuration files. Do not expect users to edit your source code.

I have the same error, and agree that you should not use hardcoded values.

I see that most of the logic revolves around this params dictionary, a very simple solution would be to write a configuration.json with all the "hardcoded" values, load this during execution and add the cmdLineInputs as needed.

Thank you both very much for the comments and advice. We have started some of the upgrades as you mentioned since the submission, and I will get back to you as soon as possible.

@JoaoRodrigues @rvhonorato Thank you both very much for the very helpful suggestions! I updated the codebase (here) accordingly and tested them myself already. Now we are waiting for any feedback from our beta tester. Once it is done (should not be long), I will update the JOSS branch and invite you to test the code.

================== Summary of the updates:

I implemented both .yaml configuration file and argparse to improve the way of handling the input arguments. Now, a user can edit input arguments via .yaml file or command line without touching source code.
- I chose to provide both ways to increase the adaptability of the code. Sometimes, a user may want to iterate a list of possible values for an argument, then it might be easier to do so in command line rather than in a configuration file.
- Currently, if an argument is specified in both .yaml and command line, the one in .yaml will overwrite that in command line. Such superiority of .yaml can be removed by commenting out that argument in .yaml.
I reorganized the output directory (eg, run2) so that only the main outputs remain in the main output folder and all intermediate or temporary files are grouped to subfolders based on which module they were generated from. For example, run2/mmb_temp_files_0/ stores temporary files generated by MMB and the _0 indicates the index of a predicted secondary structure of DNA.
In examples/ folder, example .yaml files and .sh scripts for 2 sets of automated tests are provided, as well as a ligand structure (example_peptide_ligand.pdb) and pre-folded DNA structure (foldedAptamer_0.pdb).
- To run a set of the automated tests: simply copy a pair of .sh and .yaml to the main codebase directory (contains main.py) then run .sh there.
- One set of automated tests assumes the third-party dependencies NUPACK and MMB are not available: automated_tests_no_NUPACK_MMB.sh and simu_config_automated_tests_no_NUPACK_MMB.yaml

Thank you very much again.

Hi @taoliu032,

Thanks for the revisions, looking good! I have one final comment, but I leave it up to you to implement it. I'd swap the priority of the parameters, so that command-line arguments override yaml configurations. The reason is, like you said, it's easier for people to iterate several parameters via command-line. So, they can define a yaml with a bunch of settings and then iterate, for example, on a parameter via the command line.

I'll give you the go-ahead in the JOSS thread.

@JoaoRodrigues Thank you for all the patient and helpful feedback!

I thought about making command-line arguments override yaml configuration. In that case, even if an argument is not specified via command-line, its default value from argparse (eg, Python None) in main.py will override yaml. This seems to effectively require every single argument to be specified via command-line if we don't want its default value. But there are so many arguments that it might become too clumsy if all of them show up in the command-line for a single run. But I am a novice in using argparse and configuration file, so perhaps there is some trick I do not know yet to resolve this?

In the current design, a user can achieve what you suggested: for a series of runs, they can define a single yaml that only contains the common parameters, then specify the to-be-iterated parameter(s) via command-line. In this way, since the parameters specified by yaml and command-line are mutually exclusive, there won't be overriding.

Good job on the changes! Looking great ✅

Thank you both again!

siminegroup / E2EDNA2

JOSS Review #14

Licenses

Installation

Misc

Comments on the Manuscript