nooreendabbish / Traffic

JSM 2016 GSS Data Challenge
1 stars 2 forks source link

Mosaic matrix #20

Open PatrickCoyle opened 8 years ago

PatrickCoyle commented 8 years ago

Mosaics are analogous to scatterplots and a good alternative when dealing with categorical or discrete data. I've created a prototype for mosaic scatterplot matrices. For some reason, mosaics (from the vcd package) would not display in graphical grids. I tried par(), layout() and grid.arrange() to no avail. So I made them microplots and printed to LaTeX. This is a very handy alternative to R's graphical grids!

Attached is my prototype. I will work on labeling and removing the legend. You can see Programs/R/Patrick/mosaicSplom.R to see how I made and saved them and Output/GES7.7/mosaicSplomPrint2.Rnw to see how I shrink the table to page size and print.

mosaicSplomPrint2.pdf

PatrickCoyle commented 8 years ago

Since we have the full distribution of age, it only makes sense to visualize it. See attached. But keep in mind that drowsyness is a rare event, and so is a trucking accident relatively speaking. The result is that we estimate only 1500 national accidents involving reportedly drowsy truckers. This is against ~9.25 million total estimated accidents!

  DUMMY

DROWSY 1 0 208006.977 1 1534.769

age_cond_kdes.pdf

PatrickCoyle commented 8 years ago

Here is an interesting graphic and probably a place we should start in our presentation before applying cutoffs for age and hour and model results:

Age vs. time of day for all accidents, conditioned on drowsyness

age_vs_time_cond_drowsy.pdf

PatrickCoyle commented 8 years ago

Another interesting one: I created a "time of week" variable so we can explore the week at a more granular hour-level. Here are the histograms when we condition on drowsiness:

time_of_week_hist_cond.pdf

PatrickCoyle commented 8 years ago

Below is a more sensible view. Broken up into 8-hour periods (totalling 21 bins for the week), starting/ending on Saturday evening/Sunday morning at midnight.

Nondrowsy accidents (all drivers) peak in midday (8-4), following a hill shape for each day of the week (day defined as midnight to midnight).

But DROWSY accidents generally peak in the morning and decrease throughout the day. What is really interesting is that this pattern holds true for the weekend (first and last day), but DIFFERS on Thursday and Friday, instead having a hill shape on these days (drowsy accidents peak at midday for Thursday and Friday).

time_week_hist_cond_drowsy.pdf

EDIT: The more obvious takeaway is that drowsy accidents peak on the weekend, while nondrowsy accidents have a less extreme peak midweek.

PatrickCoyle commented 8 years ago

Here is a larger mosaic splom. Note that the coloring shows standard Pearson residuals instead of weighted residuals, so we should probably remove them. But that gives us grey, which is ugly. I would like them to be colored according to a specified gradient, but I can't figure out how to do that with vcd. The graphical functions in {vcd} and {survey} seem woefully "closed"...

There are some kinks to work out with labeling.

mosaicSplomPrint2.pdf

Modeling question that I would really like to answer: is there statistical evidence that the time-of-day distribution of drowsy accidents on Thursday and Friday are different from the rest of the week? This would involve survey-weighted contingency table analysis, so we should consult Lumley's textbook. Really, it is a two-sample hypothesis test, but we may only have the tools to do single-sample, in which case we should treat the Saturday-Wednesday data as the hypothesized distribution.

Patrick

chenchen715 commented 8 years ago

Patric: What you discovered looks interesting!

  1. Could you confirm that in our model now, we want to also consider, time of the week? Also, how did you derive this variable, could you add the code/details in to the recoding post we have running?
  2. What do you want to get out of " is there statistical evidence that the time-of-day distribution of drowsy accidents on Thursday and Friday are different from the rest of the week"? Can we bring it down to a logistic model: DROWSY = time-of-day + Thursday-Friday? Where time-of-day: if HOURS in 7:23 = 1, else =0. And Thursday-Friday: if day of week is Thursday or Friday = 1, else = 0. Or you actually mean something else? If so, could you scratch out the contingency table that you are talking about?
PatrickCoyle commented 8 years ago
  1. I am recoding it as hour zero being midnight on Saturday night/Sunday morning. Here is the code:

GES2013.drivers.design %>% update(TIME_OF_WEEK = (WKDY_IM - 1) * 24 + HOUR_IM) GES2013.drivers$TIME_OF_WEEK <- (GES2013.drivers$WKDY_IM - 1) * 24 + GES2013.drivers$HOUR_IM

I have placed it in recode2013_v2 and saved GES2013.drivers and GES2013.drivers.design with the new info in the data folder.

Note that I set up the binning on the plot so that the three periods in the day run from midnight to midnight. So the histogram defines the periods as -Midnight to 8 AM -8 AM to 4 PM -4 PM to midnight

We can code with 11 PM to 7 AM when modeling, as we have been, but it might be harder to explain/interpret in terms of days, since it crosses two days.

  1. As far as testing the hypothesis that Thursday and Friday have different distributions from the rest of the week, I think we should do chi-square tests of the contingency table instead of a linear model. It looks like we can use these functions to run such a test:

?svytable

I will try to learn more about it and report back!

On Sun, Jul 10, 2016 at 11:27 AM, Chen Chen notifications@github.com wrote:

Patric: What you discovered looks interesting!

1.

Could you confirm that in our model now, we want to also consider, time of the week? Also, how did you derive this variable, could you add the code/details in to the recoding post we have running? 2.

What do you want to get out of " is there statistical evidence that the time-of-day distribution of drowsy accidents on Thursday and Friday are different from the rest of the week"? Can we bring it down to a logistic model: DROWSY = time-of-day + Thursday-Friday. where time-of-day: if HOURS in 7:23 = 1, else =0 and Thursday-Friday: if day of week is Thursday or Friday = 1, else = 0.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nooreendabbish/Traffic/issues/20#issuecomment-231597595, or mute the thread https://github.com/notifications/unsubscribe/ASq5oEhI8Y33RJ3VuSFyigiGyRVxgurIks5qUR2OgaJpZM4JGKJv .

Patrick T. Coyle PhD Student, Statistics Fox School of Business and Management Temple University patrick.coyle@temple.edu patricktmc@gmail.com (610) 761-1992

chenchen715 commented 8 years ago

Patrick:

To answer your above question, and the one we discussed earlier:

  1. If the distribution of TIME_OF_WEEK is different between Drowsy and Non-drowsy;
  2. In Drowsy subpopulation, if the distribution of TIME_OF_WEEK is different between Thrusday/Friday and the rest of the week.

Here are the findings from Chi-square tests:

  1. p-value <0.001, yes they are different;
  2. p-value = 0.003495 < 0.05, yes they are different.

We can present the findings right after the time_week_hist plot you generated, and confirm the findings that are shown in the plot.

I have the code written in Goodness_of_fit.R at location: ~\DrowsyDrivers\Programs\R\Chen, in case you want to take a look.

Chen

PatrickCoyle commented 8 years ago

Chen,

Because DROWSY is low incidence, if we give a cell to each of the 168 unique values in TIME_OF_WEEK, that gives us empty cells. I think that makes our inference bad. Also, I don't think comparing TIME_OF_WEEK to THURFRI makes sense. We should compare a binning of TIME_OF_DAY to THURFRI, I think.

I grouped time of day into three categories (11 PM to 7 AM, 7 AM to 3 PM, and 3 PM to 11 PM, to be consistent with our previous definition of NIGHT and get 3 equal time windows). Then I compared this metric against DROWSY and against THURFRI for the drowsy subset. DAY_PHASE is very significant to DROWSY (marginally AND when I pool nondrowsy and isolate drowsy accidents by day of week). But DAY_PHASE is not significant to THURFRI for drowsy driving accidents.

[image: Inline image 1]

Let me know what you think!

Patrick

On Tue, Jul 12, 2016 at 8:18 PM, Chen Chen notifications@github.com wrote:

Patrick:

To answer your above question, and the one we discussed earlier:

  1. If the distribution of TIME_OF_WEEK is different between Drowsy and Non-drowsy;
  2. In Drowsy subpopulation, if the distribution of TIME_OF_WEEK is different between Thrusday/Friday and the rest of the week.

Here are the findings from Chi-square tests:

  1. p-value <0.001, yes they are different;
  2. p-value = 0.003495 < 0.05, yes they are different.

We can present the findings right after the time_week_hist plot you generated, and confirm the findings that are shown in the plot.

I have the code written in Goodness_of_fit.R at location: ~\DrowsyDrivers\Programs\R\Chen, in case you want to take a look.

Chen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nooreendabbish/Traffic/issues/20#issuecomment-232228610, or mute the thread https://github.com/notifications/unsubscribe/ASq5oPv9ruGpwn5pvohPmJ_SMoLp5T5Sks5qVDzVgaJpZM4JGKJv .

Patrick T. Coyle PhD Student, Statistics Fox School of Business and Management Temple University patrick.coyle@temple.edu patricktmc@gmail.com (610) 761-1992

PatrickCoyle commented 8 years ago

Here is how I define DAY_PHASE:

[image: Inline image 1]

On Tue, Jul 12, 2016 at 10:08 PM, Patrick Coyle tuf74530@temple.edu wrote:

Chen,

Because DROWSY is low incidence, if we give a cell to each of the 168 unique values in TIME_OF_WEEK, that gives us empty cells. I think that makes our inference bad. Also, I don't think comparing TIME_OF_WEEK to THURFRI makes sense. We should compare a binning of TIME_OF_DAY to THURFRI, I think.

I grouped time of day into three categories (11 PM to 7 AM, 7 AM to 3 PM, and 3 PM to 11 PM, to be consistent with our previous definition of NIGHT and get 3 equal time windows). Then I compared this metric against DROWSY and against THURFRI for the drowsy subset. DAY_PHASE is very significant to DROWSY (marginally AND when I pool nondrowsy and isolate drowsy accidents by day of week). But DAY_PHASE is not significant to THURFRI for drowsy driving accidents.

[image: Inline image 1]

Let me know what you think!

Patrick

On Tue, Jul 12, 2016 at 8:18 PM, Chen Chen notifications@github.com wrote:

Patrick:

To answer your above question, and the one we discussed earlier:

  1. If the distribution of TIME_OF_WEEK is different between Drowsy and Non-drowsy;
  2. In Drowsy subpopulation, if the distribution of TIME_OF_WEEK is different between Thrusday/Friday and the rest of the week.

Here are the findings from Chi-square tests:

  1. p-value <0.001, yes they are different;
  2. p-value = 0.003495 < 0.05, yes they are different.

We can present the findings right after the time_week_hist plot you generated, and confirm the findings that are shown in the plot.

I have the code written in Goodness_of_fit.R at location: ~\DrowsyDrivers\Programs\R\Chen, in case you want to take a look.

Chen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nooreendabbish/Traffic/issues/20#issuecomment-232228610, or mute the thread https://github.com/notifications/unsubscribe/ASq5oPv9ruGpwn5pvohPmJ_SMoLp5T5Sks5qVDzVgaJpZM4JGKJv .

Patrick T. Coyle PhD Student, Statistics Fox School of Business and Management Temple University patrick.coyle@temple.edu patricktmc@gmail.com (610) 761-1992

Patrick T. Coyle PhD Student, Statistics Fox School of Business and Management Temple University patrick.coyle@temple.edu patricktmc@gmail.com (610) 761-1992

chenchen715 commented 8 years ago

Patrick:

Were you trying to include some images? I couldn't see them.

I was wondering if there would be empty cells... but according to your graph "time_week_hist_cond_drowsy.pdf", there doesn't seem to be any empty cells, that gave me the confidence to run Chisq.

And why are we talking about 168 cells? I thought we were doing 21 cells. We split a day into 3 time intervals, and 3*7=21. That was what I used to do the Chisq test.

Here is just a quick thought, and I will take a closer look at what you wrote this evening.

Chen

chenchen715 commented 8 years ago

21 cells, when comparing Drowsy vs. Nondrowsy 3 cells, when comparing THURFRI vs. The rest of the week, subsetting Drowsy population.

Essentially I was still doing test of independence.

chenchen715 commented 8 years ago

I'm sorry, now I have had a close look at what you wrote, I think we were doing the same thing. Well, no, I meant to do the same thing as what you did, but when I looked at my code, I did them without grouping the time.

I looked at the 168 cells case, it is not that horribly empty actually. Out of 168 cells of DROWSY, there are about 11 of them that are empty.

And for HOUR_IM vs THURFRI subsetting by DROWSY=1, please take a look at the table, it's not sparse...

EDIT: Sent you in email the table... it really looks horrible here.

chenchen715 commented 8 years ago

Oddly, when I group the time of a day into -Midnight to 8 AM -8 AM to 4 PM -4 PM to midnight

subsetting by DROWSY=1, DAY_PHASE vs. THURFRI are tested to be significant, p-val = 0.03676. And the contingency table is:

THURFRI 1 2 3 0 26020.785 15879.135 10008.043 1 6647.808 8059.934 5017.612

EDIT: Oh gosh, I don't know how to insert the R output correctly... have been trying, but it still doesn't look good!! You have an idea?

EDIT: Can we stick with above grouping of the day? At least we have a story to tell.

chenchen715 commented 8 years ago

And to compare between drowsy and non-drowsy of WEK_PHASE (carried forward your definition of DAY_PHASE), here is the coding code:

GES2013.drivers$WEK_PHASE <- GES2013.drivers$WKDY_IM*100 + GES2013.drivers$DAY_PHASE

It lines up with what you tested, it's also very significant, pval < 0.001.