cuh2vizR: Visualizing and reasoning about ND potential energy surfaces

Submitting Author Name: Rohit Goswami Submitting Author Github Handle: !--author1-->@HaoZeke<!--end-author1-- Repository: https://github.com/HaoZeke/cuh2vizR Submission type: Pre-submission Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: cuh2vizR
Title: Plot Copper-Hydrogen Systems
Version: 0.0.0.9000
Authors@R:
    person("Rohit", "Goswami", , "rgoswami@ieee.org", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0002-2393-8056"))
Description: A package meant to plot a CuH2 system
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Suggests:
    testthat (>= 3.0.0)
Config/testthat/edition: 3
SystemRequirements: C++17
LinkingTo:
    cpp11
Imports:
    assertthat,
    cli,
    gganimate,
    ggplot2,
    glue,
    magrittr,
    optparse,
    progress,
    readConR

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [X] data munging
- [ ] data deposition
- [X] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [X] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [ ] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

So this is a wrapper around a potential energy function written in Fortran, wrapped in C++ and then passed through to R. It can be used to read con files (though there's a separate package I'm working on for that). The main user benefit is the ability to quickly plot the energy surface of a non-trivial chemical system, that of Hydrogen molecule diffusion on a copper surface. This is an ND system, but with only two degrees of freedom, making it good to visualize.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?
Who is the target audience and what are scientific applications of this package?

Downstream users (my group, collaborators) use this to train / view and reason about potential energy surface surrogates. It make it easy to visualize a normally complex system and compute metrics. Since the eyeball norm is popular, and often we train over multiple system points, this also provides a CLI with animation capability for on-the-fly visuals.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

To the best of my knowledege there are no R packages tailored to materials science / physical chemistry use. There are python packages (e.g. ase, the atomic simulation environment), but they are far too slow. Indeed, speed is one of the main considerations of this package.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable.

Any other questions or issues we should be aware of?:

I'd love to get reviews and there's definitely a lot of science going on (potential energy functions are hard!) but the libraries make the R wrapper unlikely to make it to CRAN (there are quite a few dependencies and its best run with micromamba or pixi). I've tested it (and my collaborators) on several systems (MacOS and *nix systems including HPCs), but that was why I'm not sure it'd be a good fit.

I know there's some more work to be done (vignettes), but I was hoping to get a go/no-go for submission via the presubmission system :)

Hello @HaoZeke, thank you for your submission! This is an interesting case. We are certainly interested in supporting tools of materials science and physical chemistry tools.

First, while we don't require that things will go to CRAN, we have similar requirements that the package should pass R CMD CHECK tests on CI across platforms. That this would require setting organizing the package to usuable "standalone", without the pixi or micromamba, following R package conventions for including compiled code. As far as I can tell from browsing around your repositories, this is is doable with some reorganization but involve different build tooling that you have been using.

Secondly, I note that there are not a lot of functions in this package, given that the bulk of the development action is in readCon and potlib, and they are mostly for visualization. Straight visualization tools are out of scope for us unless they are part of the statistical EDA. The readCon package is a more straightforward fit, in that it is for "parsing scientific data types and outputs from scientific equipment", under our data extraction category. Looking at both, I think it would make sense to fold these R packages together.

Finally, I see that you've put a good bit of the development effort into the CLI, rather than R interface, for these functions. There isn't a lot of convention in this area, so I'd anticipate a good bit of discussion from reviewers about the best way to organize this. I'd encourage you to look at how it is done in these examples: https://github.com/eddelbuettel/littler/tree/master/inst/examples

So in summary - cuh2vizR is an edge case, but readConR or a combination of the two fits into our scope, and we'd be happy to review it if it can meet our testing and CI requirements.

Thank you for the detailed comments (and prompt response!). I will address each in order

Hello @HaoZeke, thank you for your submission! This is an interesting case. We are certainly interested in supporting tools of materials science and physical chemistry tools.

First, while we don't require that things will go to CRAN, we have similar requirements that the package should pass R CMD CHECK tests on CI across platforms. That this would require setting organizing the package to usuable "standalone", without the pixi or micromamba, following R package conventions for including compiled code. As far as I can tell from browsing around your repositories, this is is doable with some reorganization but involve different build tooling that you have been using.

I think that makes sense, especially readCon, will need to think a bit about potlib. pixi and micromamba are essentially helpers, without which the user must have the required libraries system wide.

Secondly, I note that there are not a lot of functions in this package, given that the bulk of the development action is in readCon and potlib, and they are mostly for visualization. Straight visualization tools are out of scope for us unless they are part of the statistical EDA. The readCon package is a more straightforward fit, in that it is for "parsing scientific data types and outputs from scientific equipment", under our data extraction category. Looking at both, I think it would make sense to fold these R packages together.

Unfortunately I don't think it would be a good idea to fold cuh2vizR into either readCon or potlib, since the bulk of the functionality is geared towards a very specific chemical system (a very useful test case, but nonetheless not general enough to be in either). I'm glad to hear the readCon package would make sense!

Finally, I see that you've put a good bit of the development effort into the CLI, rather than R interface, for these functions. There isn't a lot of convention in this area, so I'd anticipate a good bit of discussion from reviewers about the best way to organize this. I'd encourage you to look at how it is done in these examples: https://github.com/eddelbuettel/littler/tree/master/inst/examples

So in summary - cuh2vizR is an edge case, but readConR or a combination of the two fits into our scope, and we'd be happy to review it if it can meet our testing and CI requirements.

I think then, I shall clean up readconR (here) for submission first, and then try to see if I can get potlibR to compile in isolation for a possible followup.

However, one tiny note, the main functionality of cuh2vizR is one of data-wrangling, given a readCon output data-frame, this generates one point of an energy surface, to get the other points, cuh2vizR has helpers to convert the user friendly (and intuitive) concepts of H-H distance and Cu-H distance to the number of atoms * 3 matrix format and has helpers to generate a grid over H-H and Cu-H values to study the entire energy surface (formed by evaluating for each point the system configuration).

This is also why it is specialized to a single system. Every potential in potlib will provide an energy for a configuration read with readcon, but only for the CuH2 case does it make sense to plot in 2 dimensions because the effective degrees of freedom in the system are two, and so only for such a system does one need the remaining helpers (which are not exposed to the user, in R except through the generation of a data-frame), but could be. Essentially, it translates between the user and potlib + readCon to enable one to ask natural questions e.g. what about the energy when we vary H-H from blah to blah, instead of:

Reading the con file (simple with readCon)
Finding the H atoms
Modifying their distance from a reference value (optionally also H-Cu, after finding the topmost / nearest Cu atom)
Storing the energy at the requested points (potentially after writing each configuration out for potlib)

The package hooks into potlib and readCon to bypass much of that functionality in C++, letting the user focus on their scientific questions without any lags.

Similarly, the CLI is used to facilitate the production of path animations (where a system is evolving):

Does this sound like it might be suitable? If not I'll close this and reopen with readcon :)

Thanks for the explanation. In general we aim to balance generality for re-use with field-specific tooling. I admit I have a bit of trouble with physical chemistry being far my field, but based on your explanation, readCon provides a scientific format parser (👍 in scope), potlib provides a set of standard and expandable library potential calculations/values (👍 in-scope), and cuh2vizR provides a visualization for a narrow case where the dimension is 2. The thing I don't fully understand is how much cuh2vizR is a general tool vs a narrow workflow.

One reason data visualization tends to be out-of-scope is it tends to be difficult to standardize - people have various and strong opinions as to what plots should look like, and it is often quite specific to the application and venue. So we often encourage authors to produce processed data in a standard format ready for visualization by the user, and/or plot objects designed to be modified It seems to me that a standard framework that works for higher dimensions, with visualization options for 2-D cases or 2-D slices, would be best, but that's really for input from reviewers we find with some experience in your field.

But yes, I'd suggest preparing readConR for submission, and deferring cuh2vizR, which might make more sense folded in to potlibR.

Thank you, closing in favor of other issues (coming soon ^_^)

EDIT: Just to add to some of the comments:

The thing I don't fully understand is how much cuh2vizR is a general tool vs a narrow workflow.

Not general at all, even other 2D systems may have different degrees of freedom which would need special handling, e.g. maybe its the bond angle between something and a distance which are the 2 degrees of freedom for the energy surface of another molecule (that's why we have some "reference" systems like this which can be visualized without disagreement :) )

So we often encourage authors to produce processed data in a standard format ready for visualization by the user, and/or plot objects designed to be modified It seems to me that a standard framework that works for higher dimensions, with visualization options for 2-D cases or 2-D slices, would be best, but that's really for input from reviewers we find with some experience in your field.

This is an interesting idea, I haven't thought about it but I suppose a general set of manipulations of readCon files would be in scope? e.g. the functions used to say target two atoms and stretch the distance between them? Currently they're the bulk of the cuh2vizR only specialized to H, Cu, but they neeedn't be... Would that be in scope as well perhaps @noamross ? It would then a set of functions operating on dataframes from readCon and returning other modified dataframes (so it should be standard / general).

Yes, that would be in scope, under the "data munging" category if the manipulations are sufficiently standard calculations. If the manipulations are novel algorithms, it's the kind of thing that would fall under our statistical standards though admittedly those may not be a great match for some things in the physical sciences.

ropensci / software-review

cuh2vizR: Visualizing and reasoning about ND potential energy surfaces #604

Scope