ropensci / rdhs

API Client and Data Munging for the Demographic and Health Survey Data
https://docs.ropensci.org/rdhs
Other
34 stars 10 forks source link

rdhs: submission for Ropensci #21

Closed OJWatson closed 6 years ago

OJWatson commented 6 years ago

Summary

It interacts with the Demographic and Health Survey (DHS) Program API (https://api.dhsprogram.com), and provides tools to use the API to ease identifying, downloading, loading and analysing the raw survey data collected by the DHS.

Package: rdhs
Type: Package
Title: API Client and Dataset Management for the Demographic and Health Survey (DHS) Data
Version: 0.3.0
Authors@R: c(
  person("OJ", "Watson", role=c("aut", "cre"),
         email="o.watson15@imperial.ac.uk"),
  person("Jeff", "Eaton", role="aut"))
Maintainer: OJ Watson <o.watson15@imperial.ac.uk>
URL: https://ojwatson.github.io/rdhs/
BugReports: https://github.com/OJWatson/rdhs/issues
Description: Provides a client for (1) querying the DHS API for survey indicators
  and metadata (https://api.dhsprogram.com/#/index.html), (2) identifying surveys
  and datasets for analysis, (3) downloading survey datasets from the DHS website,
  (4) loading datasets and associate metadata into R, and (5) extracting variables
  and combining datasets for pooled analysis.
LazyData: TRUE
Depends: R (>= 3.3.0)
Imports: 
    R6,
    httr,
    jsonlite,
    foreign,
    magrittr,
    rappdirs,
    digest,
    storr,
    xml2,
    data.table,
    qdapRegex,
    rgdal,
    haven,
    iotools
Suggests:
    testthat,
    knitr,
    rmarkdown,
    dplyr,
    ggplot2,
    survey,
    devtools,
    microbenchmark
Remotes:
    tidyverse/haven
License: MIT + file LICENSE
RoxygenNote: 6.0.1
VignetteBuilder: knitr

https://github.com/OJWatson/rdhs

Global Health Researchers and Policy makers. The DHS data has been used in well over 20,000 academic studies (based on google scholar search for "DHS" AND "demographic and health survey") that have helped shape progress towards targets such as the Sustainable Development Goals and inform health policy such as detailing trends in child mortality and characterising the distribution and use of insecticide-treated bed nets in Africa. The package will help assist researchers who use R for these purposes rather than/don't have access to stata/sas (these datasets are the published datasets by the DHS program), as well as serve to simplify commonly required analytical pipelines. The end result aims to increase the end user accessibility to the raw data and create a tool that supports reproducible global health research.

There are a number of other R pacakges that work with DHS data in various ways. A quick search of github for "DHS" and R shows 39 repos, however the majority are small custom scripts.

1 repo looks just at interacting with the DHS API, but it hasn't been added to for almost a year, and the API endpoint functions do not cover all the endpoints available nor allow you to query each endpoint by all the possible query terms. It also requires the user to know query terms rather than having them as arguments.

1 repo also looks at downloading the survey datasets from the website (and it was used initially when designing these fucntions with rdhs). However, it skips over large dataset files, has some bugs depending on the character length of your login credentials, and does not allow you to read in all the datasets available from the website. [ FYI: we don't read in .sas7bdat (we are writing a parser for the oddly formed catalog files provided by the DHS website for these) or hierarchal dataset files as we have a parser for the flat equivalent of hierarchal dataset. In theory each file format should be the same data, so having one parser that works is sufficient, but we have found that the flat and spss data formats have the most complete meta data for the data variable labels).

There are then a few repos that do bespoke pieces of analysis (2 of which are on CRAN) looking at spatial analysis and calculating survey statistics. We are hoping to bring these onboard, either by wrapping them to use the output of our downloaded harmonised datasets, or by writing additional tools for downstream analysis (see TODO.md).

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

Yes:

R CMD check results
0 errors | 0 warnings | 0 notes
richfitz commented 6 years ago

I'm going to do this in 1 hour chunks to avoid diminishing returns. This round does nothing substantive.

rdhs_startup_message <- function(...) {
  if (!getOption("rdhs.startup.quiet", FALSE)) {
    packageStartupMessage(...)
  }
}

so that they're easy to prevent.

Validation

This caught my eye:

check_client <- function(client){

  # if it's null then return early
  if(is.null(client)) return(FALSE)

  # and if this is somehow not a client object we know return FALSE
  if(!identical(class(client),c("client_dhs","R6"))) return(FALSE)

  cred_path <- client$.__enclos_env__$private$credentials_path
  root_path <- client$get_root()

  # check for a client here
  if(root_path == "" | cred_path == "") return(FALSE)

    # check the credentials file we have for them is still valid
    credentials <- normalizePath(cred_path,winslash="/", mustWork = FALSE)

    if(!file.exists(credentials)) return(FALSE)

    # and check if it is still valid
    out <- tryCatch(expr = read_credentials(credentials),error=function(e) { NULL })

    # and return now if that errored
    if(is.null(out)) return(FALSE)

    # check the storr database is valid
    out <- inherits(client$.__enclos_env__$private$storr,"storr")

    # and check if it can be used
    out <- tryCatch(expr = client$.__enclos_env__$private$storr$set("dummy",value = 1),error=function(e) { NULL })

    # and return now if that errored
    if(is.null(out)) return(FALSE)

    ## if we have got this far it's probably good
    return(TRUE)
}

This code tests things that I don't think need testing. For example:

  # if it's null then return early
  if(is.null(client)) return(FALSE)

  # and if this is somehow not a client object we know return FALSE
  if(!identical(class(client),c("client_dhs","R6"))) return(FALSE)

should be replaced by:

  if (!inherits(client), "client_dhs") {
    return(FALSE)
  }

I can elaborate further if this is not clear.

The other bits would be best done by a method in the class validate() that would avoid all the use of R6 internals.

In general, things like

if(root_path == "" | cred_path == "") return(FALSE)

should use the || operator (rule of thumb - if it's in if you mean || or && not the vectorised versions). But looking in the code it's not clear why the root would ever be the empty string - should you not validate this on the way in? So that the client is always ok?

This bit

    # check the storr database is valid
    out <- inherits(client$.__enclos_env__$private$storr,"storr")

Is also unnecessary - you have an object that you control the constructor. If the user has messed about and changed the storr object within the private members that's not your problem! I'd avoid writing in the dummy data in the next section too.

Good practice reports some easily fixed things:

It is good practice to

✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters

R/API.R:29:1
R/API.R:49:1
R/API.R:55:1
R/API.R:76:1
R/API.R:78:1
... and 413 more lines

✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

R/extraction.R:47:59
R/rbind_labelled.R:93:14
R/rbind_labelled.R:113:14
R/rbind_labelled.R:126:33
R/rbind_labelled.R:127:19
... and 7 more lines

✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE' by default, but are not reserved words and hence can be overwritten by the user. Hence, one should always use 'TRUE' and 'FALSE' for the logicals.

R/authentication.R:NA:NA
R/client.R:NA:NA
R/read_datasets.R:NA:NA
R/read_datasets.R:NA:NA

and

Standard form for licence

LICENSE setup is nonstandard - typically you would just have

```
YEAR: 2018
COPYRIGHT HOLDER: OJ Watson
```

This is poorly documented in the official docs but does get a writeup somewhere.  I would have thought that it R CMD check would have complained about the way you have it now actually.
OJWatson commented 6 years ago

Thanks again for this Rich - i've gone through these and addressed them in the latest version #30 , and have responded to the comments above in italics.

OJWatson commented 6 years ago

incorporated