mcSamuelDataSci / CACommunityBurden

11 stars 11 forks source link

seeking efficient approach to group_by age adjustment work #50

Closed mcSamuelDataSci closed 5 years ago

mcSamuelDataSci commented 5 years ago

Zev (and Nate)-

The first coding approach below works just fine, and the dataframe "countyAA" has exactly what I need. That is: for all county, year, sex, and CAUSE combinations I get the age-adjusted death rate, and associated upper and lower CI, and SE. The only thing I am not grouping by is ageGroup, and this is what drives the whole thing.

BUT, it is clearly inefficient to run the adeadjust.direct function (part of the "epitools" package; modified by me to ageadjust.direct.SAM (to include the SE, and deal with some 0's)) four times. I did this a while ago when I really needed to just get this done, but now am trying to figure out how to do it more efficiently, and have not been successful so far despite a fair bit of effort.

The second approach below partially works, but puts the results in one column in one character string representation of a vector of results--see picture below (the ones with "NAs" are filtered out elsewhere, they have 0 deaths). I could do something to parse out this string, but I bet there is a simpler better way?

Any suggestions would be most appreciated?


countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(
            aRate   = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000
       ) 

countyAAtry2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            ) 

capture

nteetor commented 5 years ago

You could try the following,

ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct.SAM(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )

    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }
mcSamuelDataSci commented 5 years ago

Thanks for your quick reply. It runs, and gives the right number of lines, but does not generate proper results. See below. I don't know how the "."s work in your code, but thought maybe a %>% was missing before the summarize, but that didn't work... I'd be very happy to do a webex any time if that would help diagnose/solve and/or all the code is working and is on the GitHub site? Thanks!!!

image

zross commented 5 years ago

@mcSamuelDataSci Any chance you can create a reproducible/small example we can test with?

mcSamuelDataSci commented 5 years ago

Here is a reproducible example (exported the input file, and a couple other little edits):

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")

# THIS WORKS - INEFFICIENT
countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000 
            ) 

# MY ATTEMPTS... CLOSE BUT NO CIGAR
countyAA_try2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            ) 

# CLOSER, STILL NO CIGAR
countyAA_try3 <- ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )

    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }
zross commented 5 years ago

@nteetor could you please take a look?

nteetor commented 5 years ago

I am confused why values 2 through 5 are pulled from the result of ageadjust.direct() (see original try). It looks like the return value is a vector of 4 values.

mcSamuelDataSci commented 5 years ago

I am not clear on what you are saying. What I need is something like shown below.

capture

nteetor commented 5 years ago

Yes and I believe aRate is <result>[1] not <result>[2]. I do not use the epitools package, so I apologize if I am missing something. If the indexing is off and is fixed, does this help any of the problems outlined above?

mcSamuelDataSci commented 5 years ago

See shortened example below. I don't see the indexing being off. aRate is <result>[2] in all cases (<result>[1]is the "crude rate", and is not used). This code should run fine, and shows the exact issues I believe. Thanks.

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")

ageSmall <- filter(ageCounty,county=="Alameda",year==2017,sex=="Total", CAUSE %in% c("A","B","C"))

# This works just fine but is inefficient
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000) 

# This "works" but generates a vector rather than three seperate columns; I tried "unlisting" and could not get it
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
    round(ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)))
  ) 

# Your code gets close, but the values for the output varaibles are the same for all rows, and I am not sure what they are
ageSmall %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )

    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000
    ) 
  }
zross commented 5 years ago

I'm traveling, but I think I'll have a chance to look at this this evening.

zross commented 5 years ago

I thought I had a good solution, but it actually runs slower than yours despite only running the function once, perhaps you can test on a bigger dataset?

This is a common issue and I see it discussed in these references. My results match yours but take several milliseconds more. See what I have below.

@nteetor can correct me, but I also think that do() is discouraged these days but I'm not sure.

https://stackoverflow.com/questions/38223003/efficient-assignment-of-a-function-with-multiple-outputs-in-dplyr-mutate-or-summ

https://github.com/tidyverse/dplyr/issues/2326

https://github.com/romainfrancois/tie

Results running yours:

image

Results running mine:

image

tmp <- ageSmall %>%
  group_by(county, year, sex, CAUSE) %>%
  do(vals =   ageadjust.direct(count = .$Ndeaths, 
                               pop = .$pop, 
                               rate = NULL, 
                               stdpop = .$US2000POP, 
                               conf.level = 0.95))

mynames <- map(tmp$vals, names) %>% 
  unlist()

tmp %>% unnest() %>% 
  mutate(names = mynames,
         vals = round(100000*vals, 2)) %>% 
  spread(names, vals) %>% 
  select(-crude.rate) %>% 
  rename(aRate = adj.rate,
         aLCI = lci,
         aUCI = uci)
mcSamuelDataSci commented 5 years ago

I had assumed there would be something close to a "standard" (and therefore efficient) approach to this, and wanted to do it that "right" way. But, since there is not, unless you suggest otherwise, I will stick with what I did, and can now be confident that it is not ridiculous. One thing I like about my approach is that it is pretty easy to tell what it is doing, and this is a priority for the project too. I will close this unless, again, you suggest otherwise. Thanks.

zross commented 5 years ago

I'm surprised that do() is so slow here. I ran on the full dataset and see that my code takes 3x longer. If speed matters, I did see some data.table solutions in my digging. But if you can pre-compute then whatever works.