Code review of epicontacts

zkamvar commented 5 years ago

This is based on the rOpenSci code reviews: https://devguide.ropensci.org/

@zkamvar and @finlaycampbell are assigned.

zkamvar commented 5 years ago

Sorry, the review process starts here: https://devguide.ropensci.org/reviewerguide.html

finlaycampbell commented 5 years ago

I was thinking it might be best to conduct the code review on the tree branch? It's up to date with master and largely consists of the new functions vis_ttree, vis_ggplot and a bunch of helper functions. The big changes to the existing code base are in vis_epicontacts (to ensure consistency with the other plotting functions) and other plotting functions - I can look over all of these. All of the remaining functionality (clustering, etc.) aren't changed. Thoughts?

zkamvar commented 5 years ago

I don't think that would be a good idea because I think ttree needs its own thorough code review in addition to the current core functionalities of the master branch. It's better to nail down the core functionalities first so that we aren't chasing down bugs while considering new features.

finlaycampbell commented 5 years ago

OK fair enough let's stick with master for now then!

zkamvar commented 5 years ago

Here is the results of goodpractice:

> goodpractice::gp()                                                                                                                                                                            
Registered S3 method overwritten by 'httr':
  method                 from
  as.character.form_file crul
Preparing: covr
Preparing: cyclocomp
✔  checking for file ‘/tmp/RtmpHcSv9P/remotes442d27d9deba/epicontacts/DESCRIPTION’ ...
─  preparing ‘epicontacts’:
✔  checking DESCRIPTION meta-information
✔  checking vignette meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘epicontacts_1.1.1.tar.gz’

* installing *source* package ‘epicontacts’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (epicontacts)
Adding ‘epicontacts_1.1.1_R_x86_64-pc-linux-gnu.tar.gz’ to the cache
Preparing: description
Preparing: lintr
Preparing: namespace
Preparing: rcmdcheck
── GP epicontacts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 86% of code lines are covered by test cases.

    R/colors.R:64:NA
    R/colors.R:65:NA
    R/colors.R:66:NA
    R/colors.R:67:NA
    R/internals.R:36:NA
    ... and 87 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it gets invalid quite often. A build date will be added to the package when you perform `R CMD build` on
    it.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if you use
    '<-'.

    R/graph3D.R:111:23
    R/subset_clusters_by_size.R:68:16
    R/subset_clusters_by_size.R:69:16
    R/subset_clusters_by_size.R:75:16
    R/subset_clusters_by_size.R:81:16

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than
    80 characters

    R/get_clusters.R:3:1
    R/get_clusters.R:14:1
    R/get_clusters.R:15:1
    R/get_clusters.R:18:1
    R/get_clusters.R:24:1
    ... and 31 more lines

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

    R/graph3D.R:146:17

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error prone and result 1:0 if the expression on the right
    hand side is zero. Use seq_len() or seq_along() instead.

    tests/testthat/test_get_clusters.R:115:30
    tests/testthat/test_get_clusters.R:136:30

  ✖ fix this R CMD check NOTE: Namespace in Imports field not imported from: ‘colorspace’ All declared Imports should be used.

zkamvar commented 5 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Review Comments

README

Has no statement of need for the package
Only one of the three vignettes listed in the README works
Maintainer email is only partial

Documentation

get_pairwise() does not describe what it returns. Thus, it's not clear how the serial interval is derived in the first example unless you look at the code.
[.epicontacts: x[k = FALSE, j = FALSE] removes everything but the ids in the linelist... I don't think this is expected
thin() documentation doesn't explicitly say what it returns. Additionally, there are no examples.
summary.epicontacts() has documentation, but no examples and it doesn't describe what is summarized
as.igraph() doesn't work if igraph is not loaded.
get_clusters() this needs to describe how the clusters are obtained.

Tests

I see the ebola_sim linelist used quite a lot. It would be good to define a small test data set with some known pathologies (contacts without corresponding linelist components, etc) that we can use for testing known quantities. We also should get rid of the helper that auto-loads outbreaks. A lot of the tests additionally seem to repeat a lot of the data construction steps (I believe because they are set up as part of skip_on_cran(). There is no need to forbid all tests on CRAN. Some of them are okay to have.

as.igraph.epicontacts(): OK
colors: OK
clusters: the first test of as.igraph.epicontacts is unnecessary. I am unsure of what the second test is doing (what is a net node?). Neither of the get_clusters() tests actually test that the clusters are cromulent.
get_degree(): first test should read "both equals sum of in and out".
get_id(): I think these first tests need some explanation as to what is going on with the validation.
get_pairwise(): The first test needs to be updated to testthat standards. The "different types of attributes" tests need clarification on what they are testing for
graph3D(): This takes a long time to run. Use a smaller data set
handling: This needs more stress tests and more descriptive tests. A custom test handler might be useful here.
make_epicontacts: The first test needs to have more description to confirm that the process actually works
plot.epicontacts(): First test has the expectations commented out. Testing plots are always a bit of a PITA. This could benefit from a smaller dataset. Additionally, you can test against the internal workings of the resulting plot. The error expectation needs a specific error message to test against.
print.epicontacts(): This is effectively useless. It's not tested on travis or CRAN.
subset(): last bit needs to move away from the stored objects. The last test suite should be split up into smaller chunks because it goes all over the place.
subset_clusters_by_x: These need to be more specific and targeted. The clustering bit is the only thing that may be uncertain, so we need to make sure we have a data set that can actually test these.
summary(): Needs to get rid of the stored object comparison and hard-code the expected values.
thin(): OK
vis_epicontacts(): see plot.epicontacts()

finlaycampbell commented 5 years ago

I've implemented some changes on a code_review branch (see ea484306d8d582f2a11b6c65f9c256c10bca9e47)

Code

as.igraph.epicontacts

dplyr dependency?
merge(x$linelist, all_ids, by = 'id', all = TRUE)

colors

OK

get_clusters

dplyr dependency?

get_degree

No return value
Doesn’t work correctly for only_linelist = FALSE
change ‘contacts’ to ‘all’ on L48
vapply counting approach is very slow – perhaps switch to table

get_id

OK

get_pairwise

No return value
Description does not specify default behaviours for different classes
Vector is named in some cases (attribute is numeric) and not in others (attribute is character)
Fixed in commit
Could perhaps add a rev argument that calculates to → from
@param attribute should specify that you can add a character or the index of the column

graph3D

dplyr dependency
Changed to merge in commit

thin

Typo in warning message (contacts → contacts)

handling

No return value specified
Throws error when subsetting by date

internals

OK

make_epicontacts

IDs of class Date should not be allowed
Should NA IDs be allowed?

plot

The default qualitative color palette doesn’t really make sense for continuous node attributes, not a massive issue though

print.epicontacts

Change dplyr::tbl_df → tibble::as_tibble for printing?

print.summary_epicontacts

OK

subset_clusters_by_id

Description doesn’t node that linelist is also subsetted

subset_clusters_by_size

OK

subset.epicontacts

No return value
Maybe should be able to set a range for numeric attributes

summary.epicontacts

OK

thin

No return value
The example doesn’t seem to be using thin?

vis_epicontacts

Looks good for now but will have to looked over again when merging ttree

Tests

test_as.igraph.epicontacts

OK

test_colors

OK

test_get_clusters

“construction of net nodes” doesn’t really test for much AFAICT
I’ve added a test that manually constructs clusters and tests for them

test_get_degree

Added a name test for only_linelist = FALSE
Added a manual test for degree numbers calculation (this method is actually much faster than the current implementation – maybe should be swapped)

test_get_ID

OK

test_get_pairwise

No test on values
Added in commit

test_handling

Test for erroneous subsetting by Date

test_make_epicontacts

OK

test_plot.R

Testing against references not very useful

test_subset_clusters_by_ids

Added manual test with known clusters

test_subset_clusters_by_size

OK

test_subset.epicontacts

OK

test_summary.epicontacts

OK

test_thin

OK

test_vis_epicontacts

Testing against references not very useful
More tests trying to break the underlying machinery could be useful edge_color, edge_label, edge_width, node_color against numeric, character, factor, date attributes with NAs

reconhub / epicontacts