nooreendabbish / Traffic

JSM 2016 GSS Data Challenge
1 stars 2 forks source link

Slide presentation due 6/9/16 #13

Open PatrickCoyle opened 8 years ago

PatrickCoyle commented 8 years ago

Per our seminar yesterday, we should have a 15-minute slide presentation to show in seminar next week. With a consistently updated git directory, we could collaborate to create this using an R markdown file with one of the presentation formats in RStudio (ioslides, Slidy, or Beamer). Despite the learning curve, I think this would be the best option, since it allows us the create the presentation and run the R code needed for plot output all in one file, instead of creating the output, exporting from RStudio, importing to an Overleaf project and then including the graphic within the LaTeX script. The tradeoff is that we cannot see one another's updates in real time. But I think that, since there are only 3 of us, that tradeoff is worthwhile (as long as we push/pull/commit consistently and appropriately!). This will have the added benefit of building knitting/weaving and Git skills, which is something we will need for dissertations and research throughout our careers; we probably won't want to write that stuff on Overleaf!

Another option is to create a knitr/Sweave document (.Rnw) instead of an R Markdown document (.Rmd). That would allow us to use the knowledge of LaTeX syntax that we all already have. R Markdown is much simpler than LaTeX but also much less capable/customizable. But I don't think we need any particularly clever tricks to make our presentation. R Markdown should do the trick, and that is what I suggest we use.

Don't panic about learning R Markdown; it appears to be sort of easy. Check out this cheat sheet and its v2: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf v2: https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf Reference: https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf

Contents of presentation

  1. Introduce the topic of drowsy driving, including some facts from the AAA reference (see that Issues thread for the reference).
  2. Explain our variables of interest for model-building to predict drowsyness. Explain that we will predict a recent year based on a past year (or, ideally, multiple past years)
  3. Present the forward stepwise model building algorithm and the rationale for comparing them by AIC (parsimonious models do better in prediction)
  4. Explain ordinal risk group technique -- we are using a set of m dichotomous categorical variables to create a discrete ordinal variable (risk) with 2^m unique values. Explain the use of an ROC curve (predicting 2014 drowsyness from best 2013 model) in determining the "best" cutpoint among these 2^m values. Explain that the area under the ROC curve can be used as an oveall score of the predictive power of the model/rule. Explain the meaning of the jumps in the ROC curve.
  5. Explain the difference between weighted and unweighted ROC analysis. Pose the crucial question: when do the weights matter in analysis? Hypothesize that they matter when exploring variables by which the experiment stratified (like trucking and fatality). Try to evidence this by exploring trucking and fatality.

NOTE: Male shows up as our most significant risk group, but I hypothesize that truckers are highly likely to be drowsy due to a demanding work schedules, and truckers are also the most disproportionately "male" jobs in the country. That interaction should partially explain why we're finding maleness to be so important. I believe we can show truckers are mostly male using ACS data (or someone else's report based on ACS data). Check out this graphic from Bloomberg: http://www.bloomberg.com/graphics/2016-who-marries-whom/ Truckers are in the top left -- the "most male" column.

chenchen715 commented 8 years ago

Patrick -

They are very organized and well-thought plans and points, thank you for putting them together!! They very much helped me to understand/catch up with your work. I will spend some time having a closer look at Lucas's code tomorrow, as well as the codes you have written.

At this stage, I think it's the best if you could give me some hints on how best I can be of a help and assist. Besides catching up with the work you two have led, maybe I can pick up one of the topics among these 10 issues and dig more into it? Do you have in mind a priority rank for them?

Best, Chen

PatrickCoyle commented 8 years ago

At this point I think these are the three most important items (in order from most important to least important):

  1. Adding an indicator for whether someone is a trucker to the model. I hypothesize that sex and trucking are closely linked (truckers are most often male), and trucking will be a significant main effect to predict drowsyness (due to a demanding schedule), thereby making the sex main effect less significant in a model that includes trucking. I also hypothesize that adding trucking might make the weights more important in the ROC analysis due to the fact that truck presence is a stratifying effect in the weighted sampling.
  2. Boosting the prediction by using some missing - data algorithm.
PatrickCoyle commented 8 years ago

Sorry: In between 1 and 2 would be commenting on the statistical validity of using discrete categories instead of a continuous measurement for ROC analysis (check out that thread).

nooreendabbish commented 8 years ago

Hey guys, This looks great. Let me know how I can help out! Also, are we meeting either for seminar or otherwise this week?

PatrickCoyle commented 8 years ago

Please build on the following R Markdown file:

Output/GES6.9/presentation_draft1.Rmd

You can open this in RStudio and, when you are ready to knit (i.e. create the PDF), select "Knit to PDF (Beamer)" from the "Knit PDF" dropdown menu.

I think this division makes sense:

Nooreen: Introduce the topic of drowsy driving, past findings and understanding of the topic, variables of interest, importance to public health, etc. Mention some AAA stats (see the AAA thread for a reference).

Patrick: Introduce and explain our method: merge data, recode, forward-stepwise model building, ordinal risk group labeling, prediction and ROC analysis based on risk groups.

Chen: Get started on modeling trucking in addition to the rest of the predictors already listed in Patrick/predict2014.with.best.2013.model.R. Try to take our subset of predictors and apply Lucas's algorithm or EM algorithm. If you cannot finish this by Thursday, just include what was done in the slides, including introducing the missing data problem and the fact that some missing data is imputed but others are not.

Let me know what you think.

Patrick

PatrickCoyle commented 8 years ago

Hey, so I coded for heavy trucks and it doesn't look like there it is a very significant predictor for drowsyness, although the interaction between sex and heavy trucking is borderline significant. What is interesting is that, if we include heavy trucks in our forward-stepwise process, then the script does not select sex OR trucking as a predictor. I think we can use this to emphasize that forward stepwise is tricky and perhaps a bad (?) choice in the presence of significant interactions....

I will write up my results in more detail later. Check out v2's of the scripts if you are curious about how trucking is included (and other changes I made....). I think it would be just as valid to look at drivers from all day and night...but we can test for this. If the behavior of predictors are drastically different in nighttime vs. daytime, then we should model the two time periods separately! Otherwise, we should model them together. There is some type of stratification test to look into this..... Patrick.

chenchen715 commented 8 years ago

Patrick -

The task division sounds good to me. I was trying to understand your code about modeling for heavy truck last night. Maybe we can check significance of interaction terms between time of the day and other predictors, or we can do some sort of CMH test on it, but I am not sure if we can do the type of tests on survey data, I will try to find out tonight.

For the missing imputation, I am afraid I couldn't get to it for this Thursday, but I will try to write up something in the slides.

Have a great day, Chen

nooreendabbish commented 8 years ago

Hi guys,

I'll probably work on the background/stats tomorrow night. Since my portion doesn't involve using any coding and I don't have knitting/polymode working, I will just edit the markdown and not knit anything.

Thanks, Nooreen

chenchen715 commented 8 years ago

Hey guys:

I was trying to generate a .pdf file using R-markdown, and it didn't work out for me. Seems like I am missing some tex file (I couldn't really understand the error message), may need to take some time sitting down to understand.

I have put the text in the file Patrick put up, "presentation_draft1.Rmd" at location, ~\DrowsyDrivers\Output\GES6.9. If you get a chance, could you "knit the file" for me? Otherwise, I will put what I wrote in the overleaf running note tomorrow morning.

Thank you, Chen

PatrickCoyle commented 8 years ago

Ok, if you provide an error report and screenshot of the option you chose, I can probably help.

I included your text, it's in v3 in the folder (accidentally overwrote v2. sorry. growing pains.)

PatrickCoyle commented 8 years ago

http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf (lower right-hand corner of page 2)

nooreendabbish commented 8 years ago

I added information and will fix the references during two hours I have to kill. Will try to knit too, since there will be plenty of Studio users around.... If there's more to do during that time let me know.

I just saw that the slides didn't come out in the pdf (thank you Patrick or Chen)!

I am giving in and installing Studio :) I just realized you can edit comment and give emojis on Issues and am having fun with that.

PatrickCoyle commented 8 years ago

Sorry for the version issues. I've been working on the v3 suffixed by just my name. I will combine tomorrow.

Patrick

EDIT: Also Nooreen I missed your question from earlier! We are supposed to meet tomorrow at the regular time. Can you make it?

chenchen715 commented 8 years ago

To follow up with what Patrick commented for the time of day, trucking and gender variables. I was wondering how you came to think that we should model separately if predictors behave differently sub-setting by time of day (Hour=7:23 and Hour=0:6 according to the subset you set in predict2014.with.best.2013.model.AIC_v2.R).

Would it serve the purpose if we include time of day as a predictor and understand the effect it has on the other predictors by testing the interaction term?

chenchen715 commented 8 years ago

By running the following model:

test <- svyglm(DROWSY ~ HEAVY_TRUCK * SEX_IM * ifelse(HOUR_IM %in% 7:23, 1, 0), family = quasibinomial(link = logit), design = GES2013.design, subset = PER_TYP == 1 )

I found that time of day is significant as expected, and the interaction terms time of day gender, sex * heavy_truck, and the three-way interaction time of day * sex \ heavy_truck are significant. I am still trying to understand how to interpret three-way interaction, please let me know if you have any thoughts.

And here are something weird. I got into thinking whether the significance test for one specific predictor considers the rest of the predictors into the plot. Then I figured maybe a type III ANOVA would make stronger case. And I used drop1() function to obtain something similar to type III ANOVA, it generates conditional deviance and compare model 1 (removing one variable) and model 2 (full model that gets fed into the drop1() in our case, it's the glm object test).

And the "conditional tests" give different results: time of day significant, and time of day * sex is significant. Is this a sign of co-linearity for our predictors??

So now the question is which of the tests should we depend on when doing feature selection, would the "conditional tests" help us to build more stable model?