yrbss README code produces incorrect results.. consider using a different data set?

rbertolusso / intubate

<||> Interfaces to Popular R Functions for Data Science Pipelines, and More

GNU General Public License v2.0

75 stars 5 forks source link

yrbss README code produces incorrect results.. consider using a different data set? #4

Closed ajdamico closed 8 years ago

ajdamico commented 8 years ago

hi, the yrbss is a complex sample survey data set that requires the R survey package if you intend to calculate statistics and standard errors correctly. you can use @gergness srvyr package, but it might be simpler for you to just pick a different example dataset. your numbers will never match the CDC's official estimates because you are ignoring the sampling design and the survey weights. compare the results below--notice that not only the SEs are way off, but the coefficients are also incorrect. @hadley packaged the CDC's tech docs with the yrbss github package, but perhaps that repo also could use a warning that SRS computations are not appropriate. users are going to assume they can cut & paste your code, and arrive at the wrong numbers. see this folder for more detail. there are plenty of example datasets that can be analyzed without complex sample survey adjustments, so i hope you switch to one of them. thanks

# mistaken implementation of a regression from your README, assumes simple random sampling

library(dplyr)     ## Does data transformation
library(magrittr)  ## Implements pipelines
# devtools::install_github("hadley/yrbss")
library(yrbss)
data(survey)

library(intubate)
survey %>%
  filter(!is.na(stheight) & !is.na(stweight) &
           year == 2013 & sex == "Male"
         ) %>%
  select(stheight, stweight) %>%
  ntbt_lm(stweight ~ stheight) %>%  ## Using the interface function
  summary()

# correct implementation of linear model on a complex sample survey design

setwd( "C:/My Directory/YRBSS/" )

library(downloader)
source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/Youth%20Risk%20Behavior%20Surveillance%20System/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE )

load( "yrbs2013.rda" )

library(survey)     # load survey package (analyzes complex design surveys)
options( survey.lonely.psu = "adjust" )

y <- 
    svydesign( 
        id = ~psu , 
        strata = ~stratum , 
        data = x , 
        weights = ~weight ,
        nest = TRUE
    )

z <- subset( y , !is.na( q6 ) & !is.na( q7 ) & q2 == 2 )

summary( svyglm( q7 ~ q6 , z ) )

rbertolusso commented 8 years ago

Yes. I am kicking myself in the butt for it, as it fails when I check_as_cran due to the vignette (as the yrbss library does not exists in CRAN...).

I chose it as someone told me it was interesting (at least is big...) so I could show some transformations before piping to the statistical analysis.

Of course

data %>%
  ntbt_lm(y ~ x)

would have done the trick (most of the examples in the manual are like that), but I am not sure if that would have been enough for non-intermediate-to-advanced people to grasp what is going on (also wanted to show some introduction to pipelines).

About your comment on the code

# mistaken implementation of a regression from your README, assumes simple random sampling

know that that was totally on purpose. If you go to the very end of the README (and there is a note inviting you to do that right before performing the analysis...), you will see that I am making a whole point about doing analyses on wrong data. In fact, one of the things that terrifies me most about intubate is that it may make it too easy for anyone to play to be a statistician.

Now a matter of terminology (and this is a confusing point and I heard it a lot both ways). For me a simple random sample means that the sample was random but without replacement, and a random sample means that the sample was random with replacement. In that context, when you sample without replacement you violate the assumption of independence. From my definition, it is not true that the regression assumes simple random sampling, but it assumes random sampling (in other words, errors are iid ~ $N(0, \sigma^2)$). Of course if in the case of simple random sampling n << N the violation of independence does not hurt much and it is considered reasonable (whatever that means) to treat it as a random sample. Do you agree?

Back to the real issue. I am now working on other stuff, related to trying to make intubate even more flexible. Maybe down the line I will use a time series with some covariates to perform a regression analysis. Yes, I know (and that is exactly the idea) that it is the absolutely worst case (autocorrelated errors) for a regression analysis.

I am happy you spotted the issue (I do not know how many did...).

I wouldn't mind (in fact, I would appreciate) if you send me an alternative that uses a dataset provided by R with a couple of transformations that in principle make sense (so I can replace survey because I cannot leave something that doesn' t compile the vignette... if not I could have used your explanation on the weights and the need of using such and such methodology that corrects for this and that and so on...), but for which the regression analysis on the transformed data doesn't make sense (and if you can explain which the violations are, the more the merrier, so I do not have to devote time to that). The clarification will be at the very end, as it is now, and of course you will be credited for helping me making a point on this huge problem.

I hope many people complain so I can congratulate them (as I congratulate you), and refer them to this note, and/or invite them to read the end of the document. Cheers

ajdamico commented 8 years ago

i do not agree that you can use an unweighted regression to analyze weighted data from a cluster sample. just use a different data set, it's not a big edit

rbertolusso commented 8 years ago

Where do I say that it is correct to use an unweighted regression to analyze weighted data from a cluster sample? Did you read my note above?

ajdamico commented 8 years ago

whatever this is Of course if in the case of simple random sampling n << N the violation of independence does not hurt much and it is considered reasonable (whatever that means) to treat it as a random sample. Do you agree? no, i don't agree. yrbss is a cluster sample with non-random selection

rbertolusso commented 8 years ago

I think we do not understand each other and this is a pointless discussion. I will stop it here.

ajdamico commented 8 years ago

the misunderstanding is only one way. if you care to match CDC's analytic techniques, you can start here[1] but i doubt it's worth your time. the easiest way for you to revise the error in your published code is to just replace the yrbss example with a different data set

[1] https://github.com/ajdamico/asdfree/blob/master/Youth%20Risk%20Behavior%20Surveillance%20System/replicate%20cdc%20software%20for%20analysis%20of%20yrbs%20data%20publication.R

On Sun, Jul 31, 2016 at 3:25 AM, rbertolusso notifications@github.com wrote:

I think we do not understand each other and this is a pointless discussion. I will stop it here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rbertolusso/intubate/issues/4#issuecomment-236415820, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO5xh3zfBACGg9o86GtO5Or0YsRQFyks5qbE3egaJpZM4JY9qE .

rbertolusso commented 8 years ago

My only intention was to use a wrong dataset that should not be used in a regression analysis, to illustrate the perils of doing statistics mindlessly. If it not were for the fact that it doesn't build, I would have kept it (wrong as it is because that was the whole point), and would have added your explanation on why it is wrong together with the discussion that is already there at the end of the document. I even asked you to provide an alternative, because I will not address this now.

You only keep telling me: replace it because is wrong.

I know it is wrong. I purposely did something that is incorrect to prove a point.

You tell me to put something that is correct so we can protect people from thinking if they decide to just copy and paste. I do not want to protect anybody from error.

When I will change the dataset (not now because I am doing other things and I am wasting too much energy on you), the new one will still be wrong for whichever statistical analysis I will decide to use.

I understand you do not share my style. So be it.

ajdamico commented 8 years ago

i'm not the only one who's gonna miss that curveball

On Sunday, July 31, 2016, rbertolusso notifications@github.com wrote:

My only intention was to use a wrong dataset that should not be used in a regression analysis to illustrate the perils of doing statistics mindlessly. If it not were for the fact that it doesn't build, I would have keep it (wrong as it is because that was the whole point), and would have added your explanation on why it is wrong together with the discussion that is already there at the end of the document. I even asked you to provide an alternative, because I will not address this now.

You only keep telling me: replace it because is wrong.

I know it is wrong. I purposely did something that is incorrect to prove a point.

You tell me to put something that is correct so people so we can protect them from thinking if they copy and paste. I do not want to protect anybody from error.

When I will change the dataset (not now because I am doing other things and I am wasting too much energy on you), it will still be wrong for the corresponding statistical analysis.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rbertolusso/intubate/issues/4#issuecomment-236417591, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO5wv16dNz_m7uTuSrq8Tr8j9fkaUAks5qbFp0gaJpZM4JY9qE .

rbertolusso commented 8 years ago

Perhaps I need to work better in the trap so the ones like you, who wouldn't do mindless statistics and wouldn't just copy and paste, don't get angry with me thinking I am completely out of my mind doing wrong statistics and promoting others to do so.

We are supposed to be skeptical. We will only learn the hard way.

I hope now we understand each other.

(End of fighting. Please!)

About your original post, I want to thank you for:

Letting me know that there is a package, survey, that also has functions that need interfacing, such as svydesign. I will add them to the ones provided so far. If you know of more packages missing interfaces, in the ball park of "data science", please let me know.
Letting me know why the example was incorrect. I just put it there without checking at anything (the same most people do), knowing it was wrong, but not knowing why it was wrong. You gave me a lot of info. In fact, I will use it in class and after setting the trap there (I hope with some better skills), I will show, using your code, after carefully reviewing it (I am also supposed to be skeptical), how it should have been analyzed.

If you want to help me in improving the trap and help in finding another example in your area that is clearly wrong (for example one that an instructor may have used with you in the past in the area of surveys to teach you a lesson), but that compiles the vignette!, and that everybody in your area will right away find out I cannot be doing that (so it must be on purpose) I would greatly appreciate it.

We would be doing a service to a lot of people.

ajdamico commented 8 years ago

they've already been implemented in the srvyr package

On Sunday, July 31, 2016, rbertolusso notifications@github.com wrote:

Perhaps I need to work better in the trap so the ones like you, who wouldn't do mindless statistics and wouldn't just copy and paste, don't get angry with me thinking I am completely out of my mind doing wrong statistics and promoting others to do so.

We are supposed to be skeptical. We will only learn the hard way.

I hope now now understand each other.

(End of fighting. Please!)

About your original post, I want to thank you for:

-

Letting me know that there is a package, survey, that also has functions that need interfacing, such as svydesign. I will add them to the ones provided so far. If you know of more packages missing interfaces, in the ball park of "data science", please let me know.

Letting me know why the example was incorrect. I just put it there without checking at anything (the same most people do), knowing it was wrong, but not knowing why it was wrong. You gave me a lot of info. In fact, I will use it in class and after setting the trap there (I hope with some better skills), I will show, using your code, after carefully reviewing it (I am also supposed to be skeptical), how it should have been analyzed.

If you want to help me in improving the trap and help in finding another example in your area that is clearly wrong (for example one that an instructor may have used with you in the past in the area of surveys to teach you a lesson) but that compiles!, and that everybody in your area will right away find out I cannot be doing that (so it must be on purpose) I would greatly appreciate it.

We would be doing a service to a lot of people.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rbertolusso/intubate/issues/4#issuecomment-236429982, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO50OFLkbqGTxjuXGO6sBHRvTd5M26ks5qbKOggaJpZM4JY9qE .

rbertolusso commented 8 years ago

Could you please make a working example of it? (first what is bad - this will go on top without warnings. Then with what is wrong - that will go at the bottom with explanations). This way I do not have to mind about it, can concentrate in improving intubate, and can (freely) use the services of an expert?

gergness commented 8 years ago

I'm, not entirely sure I want to enter this fray, but here's a comparison of what survey weighted analysis looks like vs leaving off the survey weight variables.

library(survey)
library(srvyr)
library(dplyr)

data(api)

# Analysis using dplyr / base functions without respect to the survey design
apiclus2 %>% 
  summarize(api00_unwt = mean(api00), # mean for sample population, but is not the true population
           api00_wt = weighted.mean(api00, pw)) 
# The weighted mean reflects the survey weights (but you couldn't calculate the 
# standard errors correctly without the survey package) 

apiclus2 %>% 
  group_by(stype) %>% 
  summarize(api00_unwt = mean(api00), 
            api00_wt = weighted.mean(api00, pw)) # mean by school type

summary(glm(api00 ~ ell + meals + mobility, data = apiclus2)) 
# I use glm to match survey package, but I think I could use lm() too.

# Analysis using the survey package
dclus2 <- svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2) # Set up survey structure

svymean(~api00, dclus2) # mean for whole population (like summarize)
svyby(~api00, ~stype, dclus2, svymean) # mean by stype (like group_by + summarize)

summary(svyglm(api00 ~ ell + meals + mobility, design = dclus2)) # A linear regression

# Analysis using the srvyr package
dclus2 <- apiclus2 %>% 
  as_survey(id = c(dnum, snum), fpc = c(fpc1, fpc2)) 
# survey setup (interface changed to be 'dplyr-like', which made bigger changes
# than is typical in the intubate package from what I can tell) 

dclus2 %>% 
  summarize(api00 = survey_mean(api00)) # Mean of whole population

dclus2 %>% 
  group_by(stype) %>% 
  summarize(api00 = survey_mean(api00)) # Mean by stype

summary(glm(api00 ~ ell + meals + mobility, data = apiclus2)) 
# srvyr does not add a pipeable interface for svyglm (but the survey
# command still works). I am not sure whether it should be in
# srvyr or in something like intubate/broom.

The problem with analyzing the data without the survey design, is that a lot of times surveys are designed to over-sample groups. So for example, I worked on a survey in Iowa where we purposefully had a sample with 1/3 Black, 1/3 White, and 1/3 Hispanic respondents. However, the actual population is about 90-95% White. So any analysis without the weights doesn't make a ton of sense because you'd be working with a sample that is totally different from the true population. Therefore, I think that the unweighted means and unweighted regressions from the first section are "wrong" not just "bad" and so shouldn't be included in your example.

FWIW, I see that this discussion started about lm() vs survey.lm() and my intention with srvyr was to mimic the dplyr interface and so I never made pipeable functions for the regression models. I expected this to go into something like the broom package, but intubate could play a role as well. I think that an important distinction between intubate and srvyr is that with srvyr, I have changed the what function returns in addition to changing the interface. I can see pros and cons to both tactics, but it does make it a little difficult for me to wrap my head around how the two packages could/should work in tandem.

rbertolusso commented 8 years ago

Dear Greg, thank you so very much. Sorry I did not reply before but I just came back. I will get back to this later when I will be reviewing the README/vignette (that is already outdated...). Of course I will credit you for this. I truly appreciate your time and kindness.Do you think that your package is some sort of alternative to intubate and would you like me to put some sort of link to your github? If so, please send me a sentence with a link included and I will happily put it on the README.

gergness commented 8 years ago

Great, I'm glad I could help.

Feel free to reword, but if you do include examples about survey data in your README, I'd appreciate something like:

"The srvyr package allows for analysis of complex surveys using the pipe-friendly syntax of dplyr."