malaria-atlas-project / malariaAtlas

An R interface to open-access malaria data, hosted by the Malaria Atlas Project.
https://malariajournal.biomedcentral.com/articles/10.1186/s12936-018-2500-5
Other
43 stars 21 forks source link

Use rdhs for dhs data #30

Closed timcdlucas closed 5 years ago

timcdlucas commented 5 years ago

Hey,

I was sending someone who wanted to get all the data used for the malaria maps to this package and noticed the DHS coordinates were missing and then saw this issue :)

The following gets you very close to what you may want. I've started it in a fork, but there were a couple of dhs_ids i could not match correctly within the DHS surveys which are commented in the code below.

Most the function documentation is the same as that for rdhs::set_rdhs_config that does the auth bits for you.

Let me know what you think/any ideas on the odd dhs_ids

Ta, OJ

#' Add DHS locations to malaria data
#'
#'
#' @inheritParams rdhs::as_factor
#' @param data Data to add DHS coordinates to
#' @examples 
#' 
#' pf <- malariaAtlas::getPR("all",species = "pf")
#' pf <- fillDHSCoordinates(pf, 
#' email = "rdhs.tester@gmail.com",
#' project = "Testing Malaria Investigations")

fillDHSCoordinates <- function(data,
                                email = NULL, project = NULL, 
                                cache_path = NULL, config_path = NULL, 
                                global = TRUE, verbose_download = FALSE, 
                                verbose_setup = TRUE, data_frame = NULL, 
                                timeout = 30, password_prompt = FALSE, 
                                prompt = TRUE) {

  # set up a config for rdhs
 set_rdhs_config(email = email, project = project, cache_path = cache_path, config_path = config_path, 
    global = global, verbose_download = verbose_download, verbose_setup = verbose_setup, 
    data_frame = data_frame, timeout = timeout, password_prompt = password_prompt, 
    prompt = prompt)

  # get stems and remove blanks
  dhs_id_stems <- unique(substr(data$dhs_id, 1, 6))
  dhs_id_stems <- dhs_id_stems[nchar(dhs_id_stems)==6]

  # then there are some odd dhs ids I noticed
  dhs_id_stems[dhs_id_stems=="MDG201"] <- "MD2011"

  # I couldn't find the following ids in the datasets
  # dhs_id_stems[dhs_id_stems=="BI2012"] <- "BU2012"
  # dhs_id_stems[dhs_id_stems=="MZ2014"] <- "MZ2014"

  # find the necessary geographic data files from the DHS API
  dats <- rdhs::dhs_datasets(countryIds = unique(substr(dhs_id_stems, 1, 2)),
                             surveyYear = unique(substr(dhs_id_stems, 3, 6)),
                             fileType = "GE")
  dats <- dats[which(substr(dats$SurveyId, 1, 6) %in% dhs_id_stems),]

  # download the datasets
  geo <- get_datasets(dats)
  no_permission <- "Dataset is not available with your DHS login credentials"
  geo <- geo[-which(unlist(geo) == no_permission)]

  # missing info (can add more depending on factors, e.g. encoding of urban/rural)
  mis_info <- c("dhs_id","site_id", "latitude", "longitude")
  dhs_info <- c("DHSID","DHSCLUST", "LATNUM", "LONGNUM")

  # fill in blanks
  for(stem in dhs_id_stems) {

    # what file does the stem relate to
    file_name_match <- dats$FileName[which(substr(dats$SurveyId, 1, 6) == stem)]
    file_name <- gsub("(*).zip", "", file_name_match, ignore.case = TRUE)

    # did we find that file
    if (length(file_name)==1) {

      # read in the data and then fill in blanks
      shp <- readRDS(geo[[file_name]])@data
      matches <- match(shp$DHSID,data$dhs_id)

      data[na.omit(matches), mis_info] <- shp[which(!is.na(matches)), dhs_info]

    } 
  }

  return(data)

}

Originally posted by @OJWatson in https://github.com/malaria-atlas-project/malariaAtlas/issues/5#issuecomment-449117069

timcdlucas commented 5 years ago

Fork is here. https://github.com/OJWatson/malariaAtlas/tree/master

timcdlucas commented 5 years ago
timcdlucas commented 5 years ago

Hi @OJWatson,

I'm trying to get this to work and failling.

The line geo <- get_datasets(dats) gives me

Logging into DHS website...
Error in names(filedatatypelist_DHS) <- paste0("filedatatypelist_", qdapRegex::rm_between(filedatatypelist_DHS_line,  : 
  'names' attribute [1] must be the same length as the vector [0]

The only thing I could think was that perhaps it should have been geo <- get_datasets(dats$FileName) but that gave me the same error.

I logged in using my own email and project name. It seemed to work.

I started digging to work out what's wrong but it quickly got deep into stuff I had no idea about. Any ideas what's wrong?

Thanks in advance.

OJWatson commented 5 years ago

Hmm okay, so it seems to be erroring at the stage where rdhs goes to the Download Manager tab. A couple of things to try:

  1. With the login account that you have could you try logging in to the DHS website and then click on the Download Manager tab. This should take you to a page that looks something like this. Do you get this page?: image.
  2. If yes then you may need to give me a bit more information. Before running get_datasets(dats) could you debug the following debug(rdhs:::available_datasets). Then as you step through you'll reach the following lines:
  # Grab the content from that and start creation for last post request
  writeBin(z$content, tf)
  # load the text
  y <- readLines(tf, warn = FALSE)

Could you dump and upload what y looks like here. This should be the Download Manager web page, from which I grab all the selectable download options before making another POST request to create the url with all the download links available for your account. In grabbing the selectable options the error is thrown due to not finding any selectable options. So if you can see them in step 1, then this should let me know what's going on.

Thanks again for trying it out and trying to get this to work,

All the best,

OJ

timcdlucas commented 5 years ago

This is going to turn into one of those things where it's me being a complete idiot... sorry if that's the case.

I can't see a download manager tab and ctrl + f isn't finding me anything similar.

I get to this page:

screenshot from 2019-02-15 11-11-47

and then choosing a region gets me to here:

screenshot from 2019-02-15 11-11-56

Which is all at https://dhsprogram.com/data/dataset_admin/index.cfm.

ps I don't think I'm accidentally posting screen shots of private information other than my email address. If you notice something can you let me know and I'll delete it...

OJWatson commented 5 years ago

okay this makes more sense. (and i don't think you're posting anything private).

So to access the DHS datasets, you have to first make the account with a project name and then request dataset access. So in that second screenshot if you select all the datasets available, then in a day or when the DHS has approved your request, then you should have a Download Manager available.

timcdlucas commented 5 years ago

Oh great thanks. I won't even count that as me being totally stupid.

I'll set that up and get back to you. I'll also make sure to document this carefully in this package. Given the sideways way I've started using rdhs I've never even read the docs so no idea if it's in there. But this perhaps highlights a useful place to put an informative error message.

Thanks again!

OJWatson commented 5 years ago

Hey, yeah agree there should be a message to flag this up. Will make an issue for this. Thanks and let me know how it goes once you have datasets access.


From: Tim Lucas notifications@github.com Sent: Friday, February 15, 2019 1:07:56 PM To: malaria-atlas-project/malariaAtlas Cc: Watson, Oliver; Mention Subject: Re: [malaria-atlas-project/malariaAtlas] Use rdhs for dhs data (#30)

Oh great thanks. I won't even count that as me being totally stupid.

I'll set that up and get back to you. I'll also make sure to document this carefully in this package. Given the sideways way I've started using rdhs I've never even read the docs so no idea if it's in there. But this perhaps highlights a useful place to put an informative error message.

Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/malaria-atlas-project/malariaAtlas/issues/30#issuecomment-464044724, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AOiwnTJxTMw87OmXW7fkPHipTZ2VTHovks5vNrEsgaJpZM4ZkQYz.

timcdlucas commented 5 years ago

Hi @OJWatson. Just to say sorry I'm being so slow with this. I didn't get it to work and didn't find time to work out why.

Can't remember if I said that I was on paternity leave for the last 6 months. I'm now back at work so maybe I'll find time soonish.

OJWatson commented 5 years ago

No worries at all and congrats. There is no rush at all from my end and i'm nearing the end of my PhD so side projects (like rdhs) I will also be slow responding to as well.

timcdlucas commented 5 years ago

Cheers and good luck with finishing the phd!

timcdlucas commented 5 years ago

Hi,

Starting finally to look at this.

I did sign up for all datasets. So my Download Manager page now looks like yours.

>   geo <- get_datasets(dats)
These requested datasets are not available from your DHS login credentials:
---
AOGE52FL.zip, AOGE61FL.ZIP, AOGE71FL.zip, BJGE61FL.ZIP, BFGE61FL.zip, BFGE71FL.zip, BUGE71FL.ZIP, CMGE61FL.zip, CDGE52FL.zip, CDGE61FL.zip, CIGE61FL.ZIP, GHGE71FL.zip, GHGE7AFL.zip, GNGE61FL.ZIP, KEGE7AFL.zip, LBGE5CFL.ZIP, LBGE61FL.ZIP, LBGE71FL.ZIP, MDGE61FL.ZIP, MDGE6AFL.zip, MDGE71FL.zip, MWGE71FL.zip, MWGE7IFL.ZIP, MLGE63FL.zip, MLGE71FL.zip, MZGE61FL.ZIP, NGGE61FL.ZIP, NGGE71FL.zip, RWGE5BFL.zip, RWGE61FL.ZIP, SNGE5AFL.zip, SNGE61FL.ZIP, SNGE6IFLSR.zip, SNGE6AFL.zip, SNGE71FLSR.ZIP, SNGE71FL.ZIP, SNGE7AFL.ZIP, SNGE7AFLSR.ZIP, SNGE7IFLSR.ZIP, SNGE7IFL.ZIP, SLGE71FL.ZIP, TZGE52FL.zip, TZGE6AFL.ZIP, TZGE7AFL.zip, TZGE7IFL.ZIP, TGGE62FL.zip, TGGE71FL.ZIP, UGGE5AFL.zip, UGGE71FL.zip, UGGE7AFL.ZIP
---
Please request permission for these datasets from the DHS website to be able to download them

So get_datasets now runs without errors but I just don't get any data back.

I've again tried to work out what is and isn't working, but I really don't even know how to approach it as so much of the stuff is internal.

debug(rdhs:::available_datasets)
geo <- get_datasets(dats)

This doesn't step through the function line by line or anything like that. Which I guess it should do. I've never used debug.

So I tried doing stuff like this:

  client <- rdhs:::.rdhs$client
  private <- client$.__enclos_env__$private

But I still get totally stuck. I got to the point where I was trying to run private$check_available_datasets(dataset_filenames) line by line, but I don't really understand where that is defined and it uses a bunch of other stuff like self that again I don't understand where that is or where it comes from.

So I'm afraid I'm stuck. Again, any help much appreciated!

timcdlucas commented 5 years ago

OOok. @Danpfeffer and @shk313 got this to work no problem and it turned out to be me being an idiot. I never requested the GPS data specifically. Works for me now.

So, I'll follow up on those funny study codes. Possibly just a copy error on our side or something. Then we'll pretty much just do some careful documentation, maybe add some errors reminding people (i.e. me) to request the GPS data and add it into the package. I'll leave the issue open until the functionality is fully merged into master.

@OJWatson I guess "author" is appropriate so we'll add you as that. If for some reason you'd rather just be a "contributor" feel free to say. Thanks again!

timcdlucas commented 5 years ago

This all added and documented. Heading to CRAN.

I couldn't work out how to get testing to work but I'll open a separate issue for that and probably won't get around to fixing it for a while.

camillebelmin commented 3 years ago

Hi, Although this issue is closed, I would like to come back on it. I got the same error as @timcdlucas, although not for the same reasons apperently. When I call:

get_datasets("EGIR4ASV.rds")

I get the following error:

Logging into DHS website...
Error in names(filedatatypelist_DHS) <- paste0("filedatatypelist_", qdapRegex::rm_between(filedatatypelist_DHS_line,  : 
  'names' attribute [1] must be the same length as the vector [0]

I have read @OJWatson answer that is shown below this message. In my case, I do have access to the file I am requesting, and I can see well the download manager on the DHS webiste. I have tried to debug and reached the "y". In my case "y" is a very long string looking whose first lines look like:

  [1] "<!DOCTYPE html> <html lang=\"en\"> <!-- Content Copyright Macro International
   [2] "<!-- Page generated 2021-04-21 16:19:52 on server 1 by CommonSpot Build 10.6.0.30 (2019-10-04 12:35:29) -->"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
   [3] "<!-- JavaScript & DHTML Code Copyright &copy; 1998-2019, PaperThin, Inc. All Rights Reserved. --> <head>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
   [4] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
   [5] "<meta name=\"Description\" id=\"Description\" content=\"Download Datasets\" />"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
   [6] "<meta name=\"Generator\" id=\"Generator\" content=\"CommonSpot Build
   [7] "<title>The DHS Program - Download Datasets</title> <style id=\"cs_antiClickjack\">body{display:none !important;position:absolute !important;top:-5000px !important;}</style><script type=\"text/javascript\">(function(){var chk=0;try{if(self!==top){var ts=top.document.location.href.split('/');var ws=window.document.location.href.split('/');if(ts.length<3||ws.length<3)chk=1;else if(ts[2]!==ws[2])chk=2;else if(ts[0]!==ws[0])chk=3;}}catch(e){chk=4;}if(chk===0){var stb=document.getElementById(\"cs_antiClickjack\");stb.parentNode.removeChild(stb);}else{top.location = self.location}})();</script> <script>"
   [8] "var jsDlgLoader = '/data/dataset_admin/loader.cfm';"                                                           

But I am a bit clueless on what to do now. @OJWatson Does that help you in understanding what is going on? do you need the whole string?

Many thanks

Answer from @OJWatson on on Feb 15, 2019:

Hmm okay, so it seems to be erroring at the stage where rdhs goes to the Download Manager tab. A couple of things to try:

1. With the login account that you have could you try logging in to the DHS website and then click on the Download Manager tab. This should take you to a page that looks something like this. Do you get this page?:
   ![image](https://user-images.githubusercontent.com/15249565/52851968-4990af00-310f-11e9-9edc-768780e92a25.png).

2. If yes then you may need to give me a bit more information. Before running `get_datasets(dats)` could you debug the following `debug(rdhs:::available_datasets)`. Then as you step through you'll reach the following lines:
  # Grab the content from that and start creation for last post request
  writeBin(z$content, tf)
  # load the text
  y <- readLines(tf, warn = FALSE)

Could you dump and upload what y looks like here. This should be the Download Manager web page, from which I grab all the selectable download options before making another POST request to create the url with all the download links available for your account. In grabbing the selectable options the error is thrown due to not finding any selectable options. So if you can see them in step 1, then this should let me know what's going on.

Thanks again for trying it out and trying to get this to work,

All the best,

OJ