Questions about updating code

llrs commented 2 years ago

Some time ago a user asked me to make my package compatible with the upcoming RGCCA version https://github.com/llrs/inteRmodel/issues/4. I'm now looking into it as I'll probably be making this transition myself.

However, there are some difficulties when moving from RGCCA 2.1.2 to RGCCA >= 3. I have some questions about what will be kept and I think there are some minor modifications that could make it easier for other developers to adapt packages and scripts to the new version of RGCCA. Probably some of these changes that break the backwards compatibility are conscious decisions you made, so it might not be anything to do on your part but knowing them would help me (and probably others) to adapt the code.

Hope it is fine to ask and provide feedback here.

GFabien commented 2 years ago

Hi @llrs! I totally understand your point and I agree. I think it's a good idea to have this discussion here to document these changes. Generally, the changes/deletions were made to either simplify the code or give more meaningful names to objects.

llrs commented 2 years ago

This new release is a major release with many improvements, however it also breaks many functions that previously worked some way and now use a different interface.

NEWS:

The NEWS file on CRAN branch (inside the inst folder) says that it is a submission for Bioconductor and the DESCRIPTION now has biocViews (used only by Bioconductor or to be able to install packages from Bioconductor). Not sure if this is going to be submited to CRAN or Bioconductor but it doesn't use any of the classes of Bioconductor. It would be nice if it played well with Multi Assay Experiment objects if it is submitted to Bioconductor (if not too but it won't be required by the reviewers).

As this is a major release there are multiple breaking changes such as sgcca and rgcca are not compatible with the previous versions. It would be great place to add a notice and how do the different versions differ on the NEWS file.

NAMESPACE:

If a function is not meant to be used by external users it shouldn't be exported. I think now this happens for sgcca, scale2, cov2, tau.estimate.

DESCRIPTION:

There are now some expressions on the Description to link like ((\link{rgcca_predict})), I think these are not rendered/linked so they could be eliminated.

Many more dependencies on this release than on the previous one. While some of them are needed by the functionality some others I'm not so much convinced. Having many dependencies could make the package get archived if they fail to update (once it is on CRAN or Bioconductor) and makes it more difficult to install. scales is a direct dependency I only could find hue_pal imported but it was not used on the current form of the package.

Documentation

I had a difficult time understanding what do some functions and options do. load_blocks: it is not clear the format of the files. It doesn't use vectorized input: why use "file1,file2" instead of c("file1", "file2") ? rgcca rgccad: What is the difference of rgccad with rgcca, it seems like rgccad is an old version of rgcca.

On some examples the old argument notation of A is used while on others blocks is used, it would be nice if there were some agreement on the examples.

There are calls to internal functions on the vignetee (RGCCA:::tau.estimate) this usually indicates that the function should be exported (as it was previously). I would also like to have scale2 exported in order to be able to escale myself the data and then use it as input and interate with it with multiple models (instead of having to scale the data for each model).

On line 935 of the vignette there is a call to rgcca_stability from a rgcca object that fails because it is not produced with the sgcca method.

There was an error on chunk 57 of the vignette:

 Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
  is.atomic(x) is not TRUE

Not sure where it raises.

The shiny app points to https://github.com/rgcca-factory/RGCCA/blob/release/3.0.0/inst/shiny/tutorialShiny.md which currently don't have any tutorial of the app.

On the vignette there are many typos (ouput, ...)

There are several functions where the title match with the description, which doesn't help to understand what the function does. There can be two easy solutions: make the title shorter or explain what does the function with some more words on the description. Currently I found it hard to understand what do several functions like select_analysis, or what is the intended usage.

Code:

load_libraries force the installation of packages without respecting users options, such as repositories or libraries. This can cause troubles, make the download slower as it forces to use a mirror from CRAN away from the user and install it on a library the user might not have access to. I think it is customary when needing a package to just report that a package is need and let the user install it themselves.

Is there any way to obtain the data used on the plots but without the plot? This would help if the user wanted to have the same values but plot it differently (no error bars, different color...) without receiving a message which would help on packages.

scale3 function is quite similar to scale2 except for std[std == 0] = 1 but it is not used anywhere (except on a test).

On some cases on the plot_dynamic function there are some provisions in case the plotly is lower than 4.9. Probably it is better to just require a plotly version equal or greater than 4.9.

Nos sure if there is any way to call the shiny app from within R. I couldn't find a function calling it.

Similarly for the script on inst, I'm sure it will be more used if there is some notification to the user on the NEWS, readme or somewhere within the package documentation.

On bootstrap_k even if the method is nipals the pm function substitute all the NA values by 0. This affects both bootstrap and rgcca_stability.

I think it is safer to test which class an object is with is, so class(blocks)=="rgcca" would become is(blocks, "rgcca") (example from rgcca_cv). In the case that something is two classes at the same time the first code would fail (or provide a 2 element vector that couldn't be used on an if clause if not prepared) while the second one will only return a single logical value. This would help with objects of multiple classes (in case rgcca methods sgcca returns an object of sgcca and rgcca class)

There is a call to Sys.sleep(1) on rgcca_cv, I don't understand why it is needed, probably was left from testing it.

bootstrap uses intersection_list which removes the NA values (so it doesn't use the NA_method specified on the rgcca/sgcca call). Also there is no way to retrieve the index used to boot, so if I wanted to compare different models methods with the same bootstrap I couldn't do it (this is one of the things I implemented on my package).

remove_null_sd could use var instead of sd to make it faster as sd is a transformation of var.

There isn't a clear style of the package, sometimes the assignment operator <- and = are used or the spaces around = on arguments or %in%. Having a consistent style makes it easier to read, as well as consistently spacing on the arguments. I would recommend to run a styler to unify the style of the package.

Testing code:

There are many more tests, but most of them require visual/manual inspecting, automating it more with testthat or any other way would help (this is how I detected some changes).

I get an error when I run this code:

data("Russett", package = "RGCCA")
X_agric <- as.matrix(Russett[, c("gini", "farm", "rent")])
X_ind <- as.matrix(Russett[, c("gnpr", "labo")])
X_polit <- as.matrix(Russett[ , c("inst", "ecks",  "death", "demostab",
                                 "dictator")])
A <- list(X_agric, X_ind, X_polit)
A <- lapply(A, function(x) scale2(x, bias = TRUE))
C <- matrix(c(0, 0, 1, 0, 0, 1, 1, 1, 0), 3, 3)
out <- rgcca(A, C, tau =rep(0, 3), scheme = "factorial", method = "rgcca",
                   scale = TRUE, verbose = FALSE, ncomp = rep(2, length(A)))

I need to add a scale_block = TRUE in order to make it work. Not sure why this happens, either there is some problem with devtools loading the package, or there are some problems with the environmental calls. I also found that it is possible to specify two methods and it will break with

Error in if (tolower(method) %in% c("sgcca", "spca", "spls")) { : 
  the condition has length > 1

Overall I found much easier than expected to update my package to be compatible with the future release. I only had to intercept calls to sgcca and replace some arguments names. Looking forward to the updates of the package.

GFabien commented 1 year ago

Hi @llrs! The new version is now available on CRAN! We tried to take into account most of your comments. I'm closing this issue since most of the comments are no longer relevant. Thank you again for your feedback, and do not hesitate to open a new issue for anything.

llrs commented 1 year ago

Congratulations on the release. I'm sure it felt good to finally pass the CRAN checks. Thanks for letting me know. I might need to use RGCCA soon.

I see you went with completely removing the sgcca function but you kept some old arguments in rgcca. Thanks!

I've seen you had some problems with version 3.0.0 and the current 3.0.1 currently has some errors in the tests too. If you need to release a new version perhaps it would be a good move to document the print and plot methods for the rgcca, rgcca_cv, rgcca_permutation, rgcca_bootstrap, and rgcca_stability classes in a single page for all the classes.

rgcca-factory / RGCCA

Questions about updating code #35