Closed mcSamuelDataSci closed 5 years ago
You could try the following,
ageCounty %>%
group_by(county, year, sex, CAUSE) %>% {
sam <- ageadjust.direct.SAM(
count = .$Ndeaths,
pop = .$pop,
rate = NULL,
stdpop = .$US2000POP,
conf.level = 0.95
)
summarize(
.,
aRate = sam[2] * 100000,
aLCI = sam[3] * 100000,
aUCI = sam[4] * 100000,
aSE = sam[5] * 100000
)
}
Thanks for your quick reply. It runs, and gives the right number of lines, but does not generate proper results. See below. I don't know how the "."s work in your code, but thought maybe a %>% was missing before the summarize, but that didn't work... I'd be very happy to do a webex any time if that would help diagnose/solve and/or all the code is working and is on the GitHub site? Thanks!!!
@mcSamuelDataSci Any chance you can create a reproducible/small example we can test with?
Here is a reproducible example (exported the input file, and a couple other little edits):
library(dplyr)
library(epitools)
githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")
# THIS WORKS - INEFFICIENT
countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
summarize(aRate = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
aLCI = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
aUCI = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000,
aSE = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000
)
# MY ATTEMPTS... CLOSE BUT NO CIGAR
countyAA_try2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
summarize(aMulti = list(unique(
round(
ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
)
)
)
# CLOSER, STILL NO CIGAR
countyAA_try3 <- ageCounty %>%
group_by(county, year, sex, CAUSE) %>% {
sam <- ageadjust.direct(
count = .$Ndeaths,
pop = .$pop,
rate = NULL,
stdpop = .$US2000POP,
conf.level = 0.95
)
summarize(
.,
aRate = sam[2] * 100000,
aLCI = sam[3] * 100000,
aUCI = sam[4] * 100000,
aSE = sam[5] * 100000
)
}
@nteetor could you please take a look?
I am confused why values 2 through 5 are pulled from the result of ageadjust.direct()
(see original try). It looks like the return value is a vector of 4 values.
I am not clear on what you are saying. What I need is something like shown below.
Yes and I believe aRate
is <result>[1]
not <result>[2]
. I do not use the epitools
package, so I apologize if I am missing something. If the indexing is off and is fixed, does this help any of the problems outlined above?
See shortened example below. I don't see the indexing being off. aRate is <result>[2]
in all cases (<result>[1]
is the "crude rate", and is not used). This code should run fine, and shows the exact issues I believe. Thanks.
library(dplyr)
library(epitools)
githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")
ageSmall <- filter(ageCounty,county=="Alameda",year==2017,sex=="Total", CAUSE %in% c("A","B","C"))
# This works just fine but is inefficient
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
summarize(aRate = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
aLCI = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
aUCI = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000)
# This "works" but generates a vector rather than three seperate columns; I tried "unlisting" and could not get it
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
summarize(aMulti = list(unique(
round(ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)))
)
# Your code gets close, but the values for the output varaibles are the same for all rows, and I am not sure what they are
ageSmall %>%
group_by(county, year, sex, CAUSE) %>% {
sam <- ageadjust.direct(
count = .$Ndeaths,
pop = .$pop,
rate = NULL,
stdpop = .$US2000POP,
conf.level = 0.95
)
summarize(
.,
aRate = sam[2] * 100000,
aLCI = sam[3] * 100000,
aUCI = sam[4] * 100000
)
}
I'm traveling, but I think I'll have a chance to look at this this evening.
I thought I had a good solution, but it actually runs slower than yours despite only running the function once, perhaps you can test on a bigger dataset?
This is a common issue and I see it discussed in these references. My results match yours but take several milliseconds more. See what I have below.
@nteetor can correct me, but I also think that do()
is discouraged these days but I'm not sure.
https://github.com/tidyverse/dplyr/issues/2326
https://github.com/romainfrancois/tie
Results running yours:
Results running mine:
tmp <- ageSmall %>%
group_by(county, year, sex, CAUSE) %>%
do(vals = ageadjust.direct(count = .$Ndeaths,
pop = .$pop,
rate = NULL,
stdpop = .$US2000POP,
conf.level = 0.95))
mynames <- map(tmp$vals, names) %>%
unlist()
tmp %>% unnest() %>%
mutate(names = mynames,
vals = round(100000*vals, 2)) %>%
spread(names, vals) %>%
select(-crude.rate) %>%
rename(aRate = adj.rate,
aLCI = lci,
aUCI = uci)
I had assumed there would be something close to a "standard" (and therefore efficient) approach to this, and wanted to do it that "right" way. But, since there is not, unless you suggest otherwise, I will stick with what I did, and can now be confident that it is not ridiculous. One thing I like about my approach is that it is pretty easy to tell what it is doing, and this is a priority for the project too. I will close this unless, again, you suggest otherwise. Thanks.
I'm surprised that do()
is so slow here. I ran on the full dataset and see that my code takes 3x longer. If speed matters, I did see some data.table
solutions in my digging. But if you can pre-compute then whatever works.
Zev (and Nate)-
The first coding approach below works just fine, and the dataframe "countyAA" has exactly what I need. That is: for all county, year, sex, and CAUSE combinations I get the age-adjusted death rate, and associated upper and lower CI, and SE. The only thing I am not grouping by is ageGroup, and this is what drives the whole thing.
BUT, it is clearly inefficient to run the adeadjust.direct function (part of the "epitools" package; modified by me to ageadjust.direct.SAM (to include the SE, and deal with some 0's)) four times. I did this a while ago when I really needed to just get this done, but now am trying to figure out how to do it more efficiently, and have not been successful so far despite a fair bit of effort.
The second approach below partially works, but puts the results in one column in one character string representation of a vector of results--see picture below (the ones with "NAs" are filtered out elsewhere, they have 0 deaths). I could do something to parse out this string, but I bet there is a simpler better way?
Any suggestions would be most appreciated?