Open PatrickCoyle opened 8 years ago
Patrick -
They are very organized and well-thought plans and points, thank you for putting them together!! They very much helped me to understand/catch up with your work. I will spend some time having a closer look at Lucas's code tomorrow, as well as the codes you have written.
At this stage, I think it's the best if you could give me some hints on how best I can be of a help and assist. Besides catching up with the work you two have led, maybe I can pick up one of the topics among these 10 issues and dig more into it? Do you have in mind a priority rank for them?
Best, Chen
At this point I think these are the three most important items (in order from most important to least important):
Sorry: In between 1 and 2 would be commenting on the statistical validity of using discrete categories instead of a continuous measurement for ROC analysis (check out that thread).
Hey guys, This looks great. Let me know how I can help out! Also, are we meeting either for seminar or otherwise this week?
Please build on the following R Markdown file:
Output/GES6.9/presentation_draft1.Rmd
You can open this in RStudio and, when you are ready to knit (i.e. create the PDF), select "Knit to PDF (Beamer)" from the "Knit PDF" dropdown menu.
I think this division makes sense:
Nooreen: Introduce the topic of drowsy driving, past findings and understanding of the topic, variables of interest, importance to public health, etc. Mention some AAA stats (see the AAA thread for a reference).
Patrick: Introduce and explain our method: merge data, recode, forward-stepwise model building, ordinal risk group labeling, prediction and ROC analysis based on risk groups.
Chen: Get started on modeling trucking in addition to the rest of the predictors already listed in Patrick/predict2014.with.best.2013.model.R. Try to take our subset of predictors and apply Lucas's algorithm or EM algorithm. If you cannot finish this by Thursday, just include what was done in the slides, including introducing the missing data problem and the fact that some missing data is imputed but others are not.
Let me know what you think.
Patrick
Hey, so I coded for heavy trucks and it doesn't look like there it is a very significant predictor for drowsyness, although the interaction between sex and heavy trucking is borderline significant. What is interesting is that, if we include heavy trucks in our forward-stepwise process, then the script does not select sex OR trucking as a predictor. I think we can use this to emphasize that forward stepwise is tricky and perhaps a bad (?) choice in the presence of significant interactions....
I will write up my results in more detail later. Check out v2's of the scripts if you are curious about how trucking is included (and other changes I made....). I think it would be just as valid to look at drivers from all day and night...but we can test for this. If the behavior of predictors are drastically different in nighttime vs. daytime, then we should model the two time periods separately! Otherwise, we should model them together. There is some type of stratification test to look into this..... Patrick.
Patrick -
The task division sounds good to me. I was trying to understand your code about modeling for heavy truck last night. Maybe we can check significance of interaction terms between time of the day and other predictors, or we can do some sort of CMH test on it, but I am not sure if we can do the type of tests on survey data, I will try to find out tonight.
For the missing imputation, I am afraid I couldn't get to it for this Thursday, but I will try to write up something in the slides.
Have a great day, Chen
Hi guys,
I'll probably work on the background/stats tomorrow night. Since my portion doesn't involve using any coding and I don't have knitting/polymode working, I will just edit the markdown and not knit anything.
Thanks, Nooreen
Hey guys:
I was trying to generate a .pdf file using R-markdown, and it didn't work out for me. Seems like I am missing some tex file (I couldn't really understand the error message), may need to take some time sitting down to understand.
I have put the text in the file Patrick put up, "presentation_draft1.Rmd" at location, ~\DrowsyDrivers\Output\GES6.9. If you get a chance, could you "knit the file" for me? Otherwise, I will put what I wrote in the overleaf running note tomorrow morning.
Thank you, Chen
Ok, if you provide an error report and screenshot of the option you chose, I can probably help.
I included your text, it's in v3 in the folder (accidentally overwrote v2. sorry. growing pains.)
I added information and will fix the references during two hours I have to kill. Will try to knit too, since there will be plenty of Studio users around.... If there's more to do during that time let me know.
I just saw that the slides didn't come out in the pdf (thank you Patrick or Chen)!
I am giving in and installing Studio :) I just realized you can edit comment and give emojis on Issues and am having fun with that.
Sorry for the version issues. I've been working on the v3 suffixed by just my name. I will combine tomorrow.
Patrick
EDIT: Also Nooreen I missed your question from earlier! We are supposed to meet tomorrow at the regular time. Can you make it?
To follow up with what Patrick commented for the time of day, trucking and gender variables. I was wondering how you came to think that we should model separately if predictors behave differently sub-setting by time of day (Hour=7:23 and Hour=0:6 according to the subset you set in predict2014.with.best.2013.model.AIC_v2.R).
Would it serve the purpose if we include time of day as a predictor and understand the effect it has on the other predictors by testing the interaction term?
By running the following model:
test <- svyglm(DROWSY ~ HEAVY_TRUCK * SEX_IM * ifelse(HOUR_IM %in% 7:23, 1, 0), family = quasibinomial(link = logit), design = GES2013.design, subset = PER_TYP == 1 )
I found that time of day is significant as expected, and the interaction terms time of day gender, sex * heavy_truck, and the three-way interaction time of day * sex \ heavy_truck are significant. I am still trying to understand how to interpret three-way interaction, please let me know if you have any thoughts.
And here are something weird. I got into thinking whether the significance test for one specific predictor considers the rest of the predictors into the plot. Then I figured maybe a type III ANOVA would make stronger case. And I used drop1() function to obtain something similar to type III ANOVA, it generates conditional deviance and compare model 1 (removing one variable) and model 2 (full model that gets fed into the drop1() in our case, it's the glm object test).
And the "conditional tests" give different results: time of day significant, and time of day * sex is significant. Is this a sign of co-linearity for our predictors??
So now the question is which of the tests should we depend on when doing feature selection, would the "conditional tests" help us to build more stable model?
Per our seminar yesterday, we should have a 15-minute slide presentation to show in seminar next week. With a consistently updated git directory, we could collaborate to create this using an R markdown file with one of the presentation formats in RStudio (ioslides, Slidy, or Beamer). Despite the learning curve, I think this would be the best option, since it allows us the create the presentation and run the R code needed for plot output all in one file, instead of creating the output, exporting from RStudio, importing to an Overleaf project and then including the graphic within the LaTeX script. The tradeoff is that we cannot see one another's updates in real time. But I think that, since there are only 3 of us, that tradeoff is worthwhile (as long as we push/pull/commit consistently and appropriately!). This will have the added benefit of building knitting/weaving and Git skills, which is something we will need for dissertations and research throughout our careers; we probably won't want to write that stuff on Overleaf!
Another option is to create a knitr/Sweave document (.Rnw) instead of an R Markdown document (.Rmd). That would allow us to use the knowledge of LaTeX syntax that we all already have. R Markdown is much simpler than LaTeX but also much less capable/customizable. But I don't think we need any particularly clever tricks to make our presentation. R Markdown should do the trick, and that is what I suggest we use.
Don't panic about learning R Markdown; it appears to be sort of easy. Check out this cheat sheet and its v2: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf v2: https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf Reference: https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
Contents of presentation
NOTE: Male shows up as our most significant risk group, but I hypothesize that truckers are highly likely to be drowsy due to a demanding work schedules, and truckers are also the most disproportionately "male" jobs in the country. That interaction should partially explain why we're finding maleness to be so important. I believe we can show truckers are mostly male using ACS data (or someone else's report based on ACS data). Check out this graphic from Bloomberg: http://www.bloomberg.com/graphics/2016-who-marries-whom/ Truckers are in the top left -- the "most male" column.