ropensci / software-review

rOpenSci Software Peer Review.
286 stars 104 forks source link

## Reply to package review by @njtierney #634

Closed hanecakr closed 3 months ago

hanecakr commented 3 months ago

Again a big thank you for your review report and time. I was not able to address all issues raised earlier, but have now found some time to work on the package and to provide an answer to your comments and suggestions. I've copy-pasted the review report and inserted my replies below.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

I have added you as a reviewer in de DESCRIPTION

Documentation

The package includes all the following forms of documentation:

The opening paragraphs of the README are good, and I think that this R package solves a challenging problem, so firstly, well done! I think could be made a little bit clearer in terms of the problem it solves, and the input it takes. While I find the photos useful, it initially made me think that this software takes images as input. I would suggest something more like what is in the vignette to start:

fellingdateR offers a set of functions that assist in inferring felling date estimates from dated tree-ring series.

Then, describe the problem you want to solve, which I think is estimating when the timber was cut down. Then show the data, explain what the columns mean, and how this might be a typical example of dated tree-ring series data.

Then show a short example of the output, clearly demonstrating the problem the package solves.

The rest of the first paragraph:

The presence of (partially) preserved sapwood or waney edge allows to estimate a range for the actual felling date, for individual series as well as for a group of timbers. Furthermore, an additional function provides a tool to sum sapwood probability distributions, comparable to 'summed probability densities' commonly applied to sets of radiocarbon (14C) dates.

Is important, but I think could go into more of a methods/general introduction part of the README, perhaps further down.

I'm not sure what the images show me, and so to communicate this effectively I think they should contain a caption.

I think the target audience could be more clearly stated in the README. Perhaps at the end of the first paragraph.

README has been rewritten according to comments of both reviews.

The 'Get started' vignette provides more detail and examples.

All installed well for me!

It did run successfully locally! T and F should be specified as TRUE and FALSE.

Now TRUE and FALSE are used consistently

The examples ran without error, using:

devtools::run_examples()

There are no community guidelines in the README, I see them in the file: .github/CONTRIBUTING.md, but these are not linked to in the README. Once these are linked, e.g., by writing something like:

## Code of Conduct

Please note that the visdat project is released with a [Contributor Code of Conduct](https://github.com/hanecakr/fellingdateR/blob/main/.github/CONTRIBUTING.md). By contributing to this project, you agree to abide by its terms.

Community guidelines and code of conduct have been added

Functionality

All tests pass - unit tests seem quite good coverage, evaluated using devtools::test_coverage().

Estimated hours spent reviewing: 5


Review Comments

I wanted to open by saying that while I have a lot of feedback, I think that this is a great piece of software that helps solve a tough problem, so well done on the author for writing this! I hope that the feedback is useful 😄 . Please let me know if something is not clear or if you need help implementing these, or further information. Thank you for submitting this software, I enjoyed reviewing it.

General comments

There are a fair few examples from the rOpenSci packaging guide, which I don't think are followed, I have gone through the guide and written some examples here. After the author makes these changes, I would recommend they double check the guide.

  ✖ write short and simple
    functions. These functions
    have high cyclomatic
    complexity (>50): read_fh
    (150). You can make them
    easier to reason about by
    encapsulating distinct steps
    of your function into
    subfunctions.
  ✖ use '<-' for
    assignment instead of '='.
    '<-' is the standard, and R
    users and developers are used
    it and it is easier to read
    your code for them if you use
    '<-'.
  ✖ avoid long code lines,
    it is bad for readability.
    Also, many people prefer
    editor windows that are about
    80 characters wide. Try make
    your lines shorter than 80
    characters
  ✖ avoid sapply(), it is
    not type safe. It might return
    a vector, or a list, depending
    on the input data. Consider
    using vapply() instead.
  ✖ avoid 1:length(...),
    1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...)
    expressions. They are error
    prone and result 1:0 if the
    expression on the right hand
    side is zero. Use seq_len() or
    seq_along() instead.
  ✖ avoid 'T' and 'F', as
    they are just variables which
    are set to the logicals 'TRUE'
    and 'FALSE' by default, but
    are not reserved words and
    hence can be overwritten by
    the user.  Hence, one should
    always use 'TRUE' and 'FALSE'
    for the logicals.

all codes has been styled with the styler-package, <- , TRUE and FALSE now used consistently, and length of some functions reduced by implementing some helper-functions, e.g. for checking input. Use of sapply and 1:length() has been avoided.

if (is.null(y)) {
     y = x
     noRef = TRUE
}
else {
     noRef = FALSE
     y_ori <- y
}

any(pdf_matrix[, 2:length(keycodes) + 1] == 1, na.rm = TRUE))

Input checking

I would recommend writing small helpers for input checking, and considering using cli to help write error messages, as it means you could transform this:

if (!inherits(x, "rwl"))  {
     warning("'x' is not class rwl")
}
if (!inherits(y, "rwl"))  {
     warning("'y' is not class rwl")
}

Into:

warn_if_not_rwl(x)
warn_if_not_rwl(y)

And that code could look like this:

warn_if_not_rwl <- function(x,
                            arg = rlang::caller_arg(x),
                            call = rlang::caller_env()){
     cli::cli_warn(
          c("{arg}' is not of class {.cls rwl}")
     )
}

Similarly,

increasing_consecutive_years <- all(diff(as.numeric(row.names(x))) == 1)
if (!increasing_consecutive_years) {
     stop(
          "The tree-ring series 'x' have/has no consecutive years in increasing order as rownames."
     )
}

Could be written as a function:

check_if_increasing_consecutive_years(x)
check_if_increasing_consecutive_years(y)

Admittedly, I do have a strong preference for writing these types of functions, having written about it recently, but I do think that at least using explaining variables, which you've already done in places like:

increasing_consecutive_years <- all(diff(as.numeric(row.names(x))) == 1)

Are a great idea, and there are a few notable places where that would help make the code a bit easier to read, e.g.,

any(
length(min_overlap) != 1 |
!is.numeric(min_overlap) |
min_overlap %% 1 != 0 |
min_overlap < 3
)

cor_table.R

Refactoring values argument of cor_table. There is a lot of input checking for the values argument. I think that things such as :

if ("glk" %in%  values) {

And so on indicate to me that these could be written up as separate functions, which could return a list of their inputs, perhaps. These could then be delivered using switch, which I often forget how to use, but it would be something like:

values_output <- switch(values,
       "glk" = values_glk(inputs),
       "pearson" = values_pearson(inputs))

Examples should demonstrate all types of the inputs for the function arguments.

parameter `values` was removed from the function. Looking back, this is not an option that would be used frequently., and is certainly not required. Removing it from the function allows to shorten the code a bit, and avoids a lot of the necessary checks.

data.R

I would recommend standardising the dataset names to be all lowercase, so that they are easier to remember. E.g., Sohar_2012_FWE_c becomes: sohar_2012_fwe_c

The datasets include names of authors. The names of the datasets can be easily copied from sw_data_overview()

fd_report.R

I think that fd_report could be renamed felling_report or felling_date_report or similar. While fd is concise, I think it doesn't help facilitate discoverability of the functions.

Similar to cor_table.R, I think that:

if (!series %in% names(df)) {
      stop("--> 'series' does not exist")
}
if (!last %in% names(df)) {
      stop("--> 'last' does not exist")
}
if (!n_sapwood %in% names(df)) {
      stop("--> 'n_sapwood' does not exist")
}
if (!waneyedge %in% names(df)) {
      stop("--> 'waneyedge' does not exist")
}

Could be rewritten as check_if_variable_exists(). Something like:

check_if_variable_exists <- function(x,
                                     df,
                                     arg = rlang::caller_arg(x),
                                     call = rlang::caller_env()){
     arg_in_data <- x %in% names(df)
     if (!arg_in_data) {
          cli::cli_abort(
               c("{.arg {arg}} does not exist")
          )
     }
}

example_checker <- function(x, 
                            series = "series", 
                            last = "last"){
     check_if_variable_exists(series, x)
     check_if_variable_exists(last, x)
}

example_checker(mtcars, 
                series = "wrong")

## Error in `check_if_variable_exists()`:
## ! `series` does not exist

The check_input function is now part of helper-functions.R

get_header.R

This function should move the cat message up the top - and should not use cat, instead using one of the cli functions, like cli_abort.

I think you could use structure instead of setting attributes to NULL:

attr(rwl, "row.names") <- NULL
attr(rwl, "po") <- NULL
attr(rwl, "class") <- NULL
attr(rwl, "names") <- NULL

## becomes

rwl <- structure(
          rwl,
          row.names = NULL,
          po = NULL,
          class = NULL,
          names = NULL
     )

Although I think that they are functionally the same, so feel free to ignore!

cat() no longer used

hdi

This function uses = and <- - suggest sticking to just <-

=no longer used, in favour of <-

movAv

I think this starting chunk would be clearer if only if and not else is used.

The stop error can move to the top of this, so we clearly capture if align is not "center" or "right" or "left". This makes it easier to understand the conditions of error.

if (align == "center") {
     before <- floor((w - 1) / 2)
     after  <- ceiling((w - 1) / 2)
} else if (align == "right") {
     before <- w - 1
     after  <- 0
} else if (align == "left") {
     before <- 0
     after  <- w - 1
} else {
     stop("'align' should be 'center', 'left' or 'right'")
}

I suggest using another explaining variable inside mean:

mean(x[max(0, (i - before)):(i + after)], na.rm = TRUE)

## to something like:

earliest_to_latest <- x[max(0, (i - before)):(i + after)]
mean(earliest_to_latest, na.rm = TRUE)

## or given that this is repeated later
## potentially write this up as a function for reuse?
mean_earliest_latest(x, i, before, after)

As that mean statement is a bit involved to unfurl.

Similarly, the pattern, if (edges == "fill") { and } else if (edges == "nofill") { should be bundled up into a function and applied with switch

Checks for edges and fill are now on top of the script. Else statements have been avoided.

read_fh.R

        # NEW: verbose = TRUE, header = FALSE
        inp <- readLines(fname, ok = TRUE, warn = FALSE)
        # NEW: removes empty lines in .fh file
        inp <- inp[nchar(inp) != 0]
        ## Get start and end positions of headers and data blocks
        header.begin <- grep("^HEADER:$", inp)
        # NEW: Quadro => chrono
        # NEW: Double => half chrono
lengths <- numeric(n) # commit Ronald Visser

I have found that moving comments either into documentation or into issues to help track them is helpful, but I appreciate that sometimes it is best to leave them in the code, but just something that might be worth thinking about :)

Tidying up the error messages in this function would make some of these nested if/else clauses easier to understand.

This is a pretty massive function, a bit over 1200 lines of code. I would recommend breaking down the steps inside this into smaller functions, as this will make the code easier to reason with and maintain in the future.

In the fellingdateR package I build upon the original code of the read.fh function from the dplR package. I would prefer to stay a close as possible to the original code in the dplR package in order to facilitate future cooperation and possible integration of both functions.

I removed all unnecessary comments as they were highlighting sections where I've made changes to the original code.

dplR::read.fh() concentrates on extracting the measurement data. The fellingdateR::read_fh() function extracts also the descriptive (meta-)data from the HEADER fields in a .fh file. This is not possible with the dplR::read.fh function.

Furthermore the fellingdateR::read_fh function allows to read data in CHRON or HALF-CHRONO format.

read.fh() also throws errors when header fields include Capital letters (depends on the software used to produce the .fh files: TSAP, PAST, ...). read_fh() is case-insensitive

sw_combin_plot.R

This is the first time I've seen ############ comment blocks - I'm all for stylistic choices but I am not sure this is needed, especially if this isn't used in other functions.

comment blocks with #### removed

I've not seen this pattern to avoid R CMD Check notes before

   # to avoid notes in CMD check
   year <-
      p <-
      lower <-
      upper <- COMB <- last <- n_sapwood <- A_i <- agreement <- NULL

My tactic has always been to have a separate definition of these, as answered by Carson Sievert on the posit community paage. I don't think there's anything inherently wrong with that, but I could imagine that in some cases this could accidentally erase inputs. Something to be aware of, perhaps?

When I run devtools::check() I get

❯ checking R code for possible problems ... NOTE  
x: no visible binding for global variable ‘p’ 

assigning NULL to these variables avoids the notes., as described in R Packages (2e) https://r-pkgs.org/package-within.html#delta-a-failed-attempt-at-making-a-package

I am all for using the new base R pipe |> - however you need to update your Depends in your DESCRIPTION like so in order to use it, since it only came out in R 4.1.0:

Depends: 
    R (>= 4.1.0)

This comment should probably live in a github issue or just be removed:

      # NEXT LINE TRIGGERS WARNING
      # Warning message:
      # Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
      # ℹ Please use one dimensional logical vectors instead.
      # ℹ The deprecated feature was likely used in the fellingdateR package.
      # Please report the issue to the authors.
      # { if (nrow(summary |> dplyr::filter(agreement == "poor")) != 0)
      # replaced by:

these comments are removed

sw_combine.R

This error should check each of the conditions separately - either it has missing values, or it is not numeric.

if (any(is.na(endDate)) | !is.numeric(endDate)) {
     stop(
          "--> Please check the column with 'end dates'.
Some values are possibly missing or the values are not numeric"
     )
}

A check_input() function (in helper-functions.R) now takes care of the input

sw_data_info.R

I think these error messages would benefit from using cli, as discussed above.

sw_data_overview.R

This is a nice function to include to facilitate data discovery

sw_interval_plot.R

This code

if (all(
     !(attributes(x)$names) %in% c(
          "year",
          "n_sapwood",
          "p")
))
     stop("Input differs from output sw_interval()")

Could be rewritten as an error function or the condition in if could be expressed as a function.

sw_interval.R

In the final line of documentation for this function there is a hanging sentence:

#' @return Depends on the value of `hdi`.
#'
#'  * If `hdi = TRUE`, a `numeric vector` reporting the upper and lower limit
#'   of the hdi (attributes provide more detail on `credMass` and the applied
#'   sapwood model (`sw_data`)).
#'  * If `hdi = FALSE`, a `matrix` with scaled p values for each number of
#'   observed sapwood rings. This matrix

Well spotted! Corrected.

sw_model.R

Great to see input checking at the top of the function - I do think these should be rewritten as check input functions.

Helper function d.count I think should be put into a separate R file called utils.R or helpers.R

d.count should use switch pattern and pass functions rather than using if controls.

d.count should be d_count

check_input() and d_dens() (instead of d.count) are now part of helper-functions.R

sw_sum_plot.R

indentation in this code is not consistent - recommend applying a style guide.

Examples should show different variations possible for function arguments. E.g., bar_col, spline_col, dot_col, and dot_size should all be specified in the examples so the user can see what the input should/could be.

examples have been updated with more visibility for the different parameters.

sw_sum.R

See note above on including plots.

tests

ldecicco-USGS commented 3 months ago

I think you need to paste this as a reply to this issue: https://github.com/ropensci/software-review/issues/618

hanecakr commented 3 months ago

I think you need to paste this as a reply to this issue: #618

oops! Can I remove this issue?

ldecicco-USGS commented 3 months ago

I'll just close it, no worries!