Summary

What does this package do? (explain in 50 words or less):

It interacts with the Demographic and Health Survey (DHS) Program API (https://api.dhsprogram.com), and provides tools to use the API to ease identifying, downloading, loading and analysing the raw survey data collected by the DHS.

Paste the full DESCRIPTION file inside a code block below:

Package: rdhs
Type: Package
Title: API Client and Dataset Management for the Demographic and Health Survey (DHS) Data
Version: 0.3.0
Authors@R: c(
  person("OJ", "Watson", role=c("aut", "cre"),
         email="o.watson15@imperial.ac.uk"),
  person("Jeff", "Eaton", role="aut"))
Maintainer: OJ Watson <o.watson15@imperial.ac.uk>
URL: https://ojwatson.github.io/rdhs/
BugReports: https://github.com/OJWatson/rdhs/issues
Description: Provides a client for (1) querying the DHS API for survey indicators
  and metadata (https://api.dhsprogram.com/#/index.html), (2) identifying surveys
  and datasets for analysis, (3) downloading survey datasets from the DHS website,
  (4) loading datasets and associate metadata into R, and (5) extracting variables
  and combining datasets for pooled analysis.
LazyData: TRUE
Depends: R (>= 3.3.0)
Imports: 
    R6,
    httr,
    jsonlite,
    foreign,
    magrittr,
    rappdirs,
    digest,
    storr,
    xml2,
    data.table,
    qdapRegex,
    rgdal,
    haven,
    iotools
Suggests:
    testthat,
    knitr,
    rmarkdown,
    dplyr,
    ggplot2,
    survey,
    devtools,
    microbenchmark
Remotes:
    tidyverse/haven
License: MIT + file LICENSE
RoxygenNote: 6.0.1
VignetteBuilder: knitr

URL for the package (the development repository, not a stylized html page):

https://github.com/OJWatson/rdhs

Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
- data access, because it interacts with the DHS API
- data retrieval, because it helps download and load the raw datasets from the DHS website, after using the API to first identify those survey datasets relevant for your analysis
Who is the target audience and what are scientific applications of this package?

Global Health Researchers and Policy makers. The DHS data has been used in well over 20,000 academic studies (based on google scholar search for "DHS" AND "demographic and health survey") that have helped shape progress towards targets such as the Sustainable Development Goals and inform health policy such as detailing trends in child mortality and characterising the distribution and use of insecticide-treated bed nets in Africa. The package will help assist researchers who use R for these purposes rather than/don't have access to stata/sas (these datasets are the published datasets by the DHS program), as well as serve to simplify commonly required analytical pipelines. The end result aims to increase the end user accessibility to the raw data and create a tool that supports reproducible global health research.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

There are a number of other R pacakges that work with DHS data in various ways. A quick search of github for "DHS" and R shows 39 repos, however the majority are small custom scripts.

1 repo looks just at interacting with the DHS API, but it hasn't been added to for almost a year, and the API endpoint functions do not cover all the endpoints available nor allow you to query each endpoint by all the possible query terms. It also requires the user to know query terms rather than having them as arguments.

1 repo also looks at downloading the survey datasets from the website (and it was used initially when designing these fucntions with rdhs). However, it skips over large dataset files, has some bugs depending on the character length of your login credentials, and does not allow you to read in all the datasets available from the website. [ FYI: we don't read in .sas7bdat (we are writing a parser for the oddly formed catalog files provided by the DHS website for these) or hierarchal dataset files as we have a parser for the flat equivalent of hierarchal dataset. In theory each file format should be the same data, so having one parser that works is sufficient, but we have found that the flat and spss data formats have the most complete meta data for the data variable labels).

There are then a few repos that do bespoke pieces of analysis (2 of which are on CRAN) looking at spatial analysis and calculating survey statistics. We are hoping to bring these onboard, either by wrapping them to use the output of our downloaded harmonised datasets, or by writing additional tools for downstream analysis (see TODO.md).

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package has an obvious research application according to JOSS's definition.
- [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

Yes:

R CMD check results
0 errors | 0 warnings | 0 notes

[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

I'm going to do this in 1 hour chunks to avoid diminishing returns. This round does nothing substantive.

you may not be able to use the DHS logo without clearing it with them - it may be taken as implying endorsement/association.
- I asked them and they said fine for now.
in the opening of the readme, I'd move the "Motivation" section higher because the enumerated list does not make much sense to a naive user without it.
- Done
Remove startup message
- Done
Style & structure
- the whole package would benefit from a style overhaul. This is boring, but really useful if you want people to contribute to the package. You might get away with running the whole thing through a code reformatter! But look at the output of lintr::lint_package() to get started. This will make a review much easier
- inconsistent spacing around arguments
- inconsistent spacing around = (e.g. normalizePath(path,winslash="/", mustWork = FALSE))
- inconsistent spacing around function definitions
- inconsistent use of newlines around if/else blocks
- inconsistent casing (e.g., set_rdhs_CREDENTIALS_PATH)
- there are functions here that are reather long and have a lot of logic - particularly with R6 methods I find it more maintainable with very short functions and free functions for anything that needs something with significant logic.
- All the style things have been addressed. Casing in that example I have left just as it aligns with the environment variable it is setting, which I think are meant to in caps right?
Dependencies: this is quite a large list of strong dependencies. It would be good to comb through this and work out which are really needed.
- why return data.table? this can surely be done by users if that's the framework that they are used to working in? The only other use of the pacakjge seems to be data.table::rbindlist which is easily implemented (albeit slower - but I'm not sure that will be a major consideration)
- Jeff wanted data.table.... I went through these before i sent it over and trimmed it a bit, but data.table is now also gone. Have also put in something here (https://github.com/OJWatson/rdhs/blob/patching/R/API.R#L73) to allow those who want data.table to easily have that without actually having to have it in imports.
startup:
- I would not call .onAttach() from within .onLoad - instead I would factor out the bootstrapping code into its own (testable) function and call it from there. As it is I believe (but have not verified) that your code will be called twice - once when the package is loaded and a second time when it is attached.
- _Done. .onLoad is just gone now, and thus if you try and use a function without attaching the package that requires the package environment client this will be caught by check_for_client_
- I would move all your startup messages behind an option, like

rdhs_startup_message <- function(...) {
  if (!getOption("rdhs.startup.quiet", FALSE)) {
    packageStartupMessage(...)
  }
}

so that they're easy to prevent.

Done
I would be very wary about writing to a user's Renviron file. If you're going to do it then consider a package that does this for you, but possibly just write instructions. You also need to be careful not to fall foul of the cran policy on writing here. As an alternative, you might want to use a system where you configure per-directory .Renviron files (R reads the system one first, them home directory, then the current working directory). Or do something like have an rhds.env file that you look for and avoid Renviron at all.
- I have gone for making the user provide permission explicitly for anything to be written to file outside of their tempdir using a prompt. (how can you test code that has prompts by the way...). For the .Renviron editing i've grabbed the bits from the startup package to ensure that we get a cross platfrom .Renviron path, but I couldn't find a package that looked at writing to this.

Validation

Yeah it was a haphazardly over the top file. Has been largely trimmed down and i've gone through and corrected for vectorised | and & where they are not actually needed.

This caught my eye:

check_client <- function(client){

  # if it's null then return early
  if(is.null(client)) return(FALSE)

  # and if this is somehow not a client object we know return FALSE
  if(!identical(class(client),c("client_dhs","R6"))) return(FALSE)

  cred_path <- client$.__enclos_env__$private$credentials_path
  root_path <- client$get_root()

  # check for a client here
  if(root_path == "" | cred_path == "") return(FALSE)

    # check the credentials file we have for them is still valid
    credentials <- normalizePath(cred_path,winslash="/", mustWork = FALSE)

    if(!file.exists(credentials)) return(FALSE)

    # and check if it is still valid
    out <- tryCatch(expr = read_credentials(credentials),error=function(e) { NULL })

    # and return now if that errored
    if(is.null(out)) return(FALSE)

    # check the storr database is valid
    out <- inherits(client$.__enclos_env__$private$storr,"storr")

    # and check if it can be used
    out <- tryCatch(expr = client$.__enclos_env__$private$storr$set("dummy",value = 1),error=function(e) { NULL })

    # and return now if that errored
    if(is.null(out)) return(FALSE)

    ## if we have got this far it's probably good
    return(TRUE)
}

This code tests things that I don't think need testing. For example:

  # if it's null then return early
  if(is.null(client)) return(FALSE)

  # and if this is somehow not a client object we know return FALSE
  if(!identical(class(client),c("client_dhs","R6"))) return(FALSE)

should be replaced by:

  if (!inherits(client), "client_dhs") {
    return(FALSE)
  }

I can elaborate further if this is not clear.

The other bits would be best done by a method in the class validate() that would avoid all the use of R6 internals.

In general, things like

if(root_path == "" | cred_path == "") return(FALSE)

should use the || operator (rule of thumb - if it's in if you mean || or && not the vectorised versions). But looking in the code it's not clear why the root would ever be the empty string - should you not validate this on the way in? So that the client is always ok?

This bit

    # check the storr database is valid
    out <- inherits(client$.__enclos_env__$private$storr,"storr")

Is also unnecessary - you have an object that you control the constructor. If the user has messed about and changed the storr object within the private members that's not your problem! I'd avoid writing in the dummy data in the next section too.

Good practice reports some easily fixed things:

Good practice now doesn't have any comments. lint mentions some upper case variables bu these only exist where they are the variables returned from their API as I didn't want to change that.

It is good practice to

✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters

R/API.R:29:1
R/API.R:49:1
R/API.R:55:1
R/API.R:76:1
R/API.R:78:1
... and 413 more lines

✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

R/extraction.R:47:59
R/rbind_labelled.R:93:14
R/rbind_labelled.R:113:14
R/rbind_labelled.R:126:33
R/rbind_labelled.R:127:19
... and 7 more lines

✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE' by default, but are not reserved words and hence can be overwritten by the user. Hence, one should always use 'TRUE' and 'FALSE' for the logicals.

R/authentication.R:NA:NA
R/client.R:NA:NA
R/read_datasets.R:NA:NA
R/read_datasets.R:NA:NA

and

Standard form for licence

LICENSE setup is nonstandard - typically you would just have

```
YEAR: 2018
COPYRIGHT HOLDER: OJ Watson
```

This is poorly documented in the official docs but does get a writeup somewhere.  I would have thought that it R CMD check would have complained about the way you have it now actually.

Thanks again for this Rich - i've gone through these and addressed them in the latest version #30 , and have responded to the comments above in italics.

ropensci / rdhs

rdhs: submission for Ropensci #21

Summary

Requirements

Publication options

Detail

Validation