joycepyang commented 6 years ago

Hi Lindsey and Savet, I just uploaded new code named "plots". I am making this an issue rather than a pull request per my previous conversation with Lindsey about listing questions as issues rather than pull requests. (If it's better served in pull request form, please let me know and I can adjust for next time.)

I followed these steps from last week:

Build a script a for CPT clinics and non-CPT participating clinics
Produce the mean centered variables ("xc") 3. Produce the distribution graphs of the variable using hist( ) in mean and mean centered form
Put the variable mean on the x-axis labeled "mean," put "clinic" on the y-axis

It would be great if you could review the code, I've also knitted it as html and have uploaded it to the lucid meeting for tomorrow.

Two questions I have that I'm not sure are important to answer or not (depending on which plots we decide are useful to keep or not): a) How do we remove the tick / hash marks in the plots that are over populated? I tried several different ways to mask it, including changing the breaks but that did not work.

b) I also tried to print the plots side by side which they were doing in the Rstudio but not in the html.

lzim commented 6 years ago

We’ll take a look @joycepyang thanks!

saveth commented 6 years ago

@joycepyang @lzim

I'm not sure what you mean with regards to a. Can you elaborate or point to a specific example?

With regards to b, you can use the function par(mfrow=c(nr, nc)) before your plots command. nr stands for number of row and nc is number of columns. If you have par(mfrow=c(2,3)) then it'll make a 2x3 grid for you make 6 graphs in 2 rows by 3 columns.

lzim commented 6 years ago

Hello @joycepyang and @saveth - we need to merge this issue #205 with outstanding issue #76

Specifically, we don't want to lose track of the time-based displays Next Steps #4, 4 and 5, in issue #76, which are still outstanding.

Thanks,

Lindsey

joycepyang commented 6 years ago

Thanks @lzim for the continued guidance on how best to keep track of everything on GitHub; I didn't think about issue merges.

joycepyang commented 6 years ago

@saveth Sorry that wasn't very specific. If you look at the printed plots, you can see that on the y-axis, there's a section that's completely dark b/c each sta6a is being labeled on the y-axis so it's all overlapping. Is there a way to not print the labels?

saveth commented 6 years ago

@joycepyang Here's two ways to remove it depending on the type of plot command used. If you're using base plot use yaxt='n' in the plot command. For instance, plot(1:10, yaxt = 'n'). If you're using ggplot,like the ones used in your script, then use element_blank() in the theme command. For instance, ggplot(data, aes(x,y)) + geom_point() + theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())

joycepyang commented 6 years ago

Thanks @saveth for the suggestions about using element_blank! that definitely worked. I updated all of those in the plots. I also used the par(mfrow = c(1,2)) code to put them on the same line; it definitely worked for the histograms although not the plots; I'm not sure why.

One other thing I also ran into is that for two of the variables, I was unable to get the histogram to run due to the number of breaks:

mean centered


#in CPT
tmh_mean_cpt$mean_xc <- tmh_mean_cpt$mean - mean(tmh_mean_cpt$mean)
#in CDW
tmh_mean_cdw$mean_xc <- tmh_mean_cdw$mean - mean(tmh_mean_cdw$mean)
par(mfrow=c(1, 2))
hist(tmh_mean_cpt$mean_xc, main = "TMH CPT", xlab = "Centered Mean", bins = 20)
hist(tmh_mean_cdw$mean_xc, main = "TMH CDW", xlab = "Centered Mean", bins = 20)

```Error in hist.default(tmh_mean_cpt$mean_xc, main = "TMH CPT", xlab = "Centered Mean") : invalid number of 'breaks'

This also occurred again in the mmencounter, groupencounter, CPT inital appointments . 

I'll attach the knitted file here so you can see that as well. 

Last point; after eliminating the values that appeared repeatedly without clear reason why (e.g., 2089 in tmh), some of the remaining variables had very few data points; especially CPT and PE initial appointments. It would be great to discuss this on our call next time @lzim as I'm not really sure what is happening

saveth commented 6 years ago

@joycepyang I think the Skype session after this post helped addressed all the technical issues you had with the code. Guess what remains is your questions for @lzim .

lzim commented 6 years ago

From original issue #76

Nest steps #4: Due to the longitudinal/observational focus of our primary analyses, we do need to understand whether these measures of central tendency in each data set are obscuring secular trends. Specifically, it is very likely that overall demand for services (as measured by encounters), and adoption of EBPsy (as measured by CPT and PE templates) is increasing

Next steps #5: To explore and report on this, we would need graphs over time that show the measure of central tendency for each quarter observed in the dataset, with box and whiskers spread of the distribution for that quarter

lzim commented 6 years ago

@saveth

Thanks so much for your work on the Shiny app! 💯

I think that the thing I wonder about the most is what I requested at the end of the quant workgroup meeting on Thursday - the possibility of box and whiskers plots with time on the x-axis, and variable on y-axis, so that for each time observation (year or month), we could see a distribution of the selected variable.

I explored a bit today, I'm going to look a little more, but this is top of mind! Let me know if you have questions

Thanks!

Lindsey

lzim commented 6 years ago

@saveth

To be clearer - Due to the high level of variability and the skew we have observed, it would be great to use the median 25th/75%ile box and whisker with outliers (e.g., > 95%ile or something) depicted as dots. Thanks!

lzim commented 6 years ago

@saveth Thinking about it even more, violin plots, and ridgeline plots, or even possibly letter-value plots are likely more useful for visualizing and understanding these data and key distributions.

I realized that the box plot I mentioned will likely look quite static as the statistical summaries stay the same, while the distributions are changing (prefer violin to box-plot for this). And, density plots on their own, are very difficult to see/interpret with the multiple nested observations we have (prefer ridgeline to box-plot for this). Finally, since this is a larger dataset we can pursue visualization that affords more precise information about our tails.

violin plots for understanding how distributions of data vary over time.

Violin plot with time of observation - year or month - on x-axis, and variable on the y-axis

median in the middle
box and whiskers shows interquartile range
shape of the violin displays frequencies of values

This will be really nice for our non-normally distributed data!

A few other thoughts about stratifying the data to display these, however: - Another way that might be helpful to see these would be with the variable on the x-axis and the clinics on the y-axis grouped and ordered in a meaningful way (Note: It may also make just as much sense for the grouping variable to by on the x-axis and the variable on the y-axis. But, since it is most common for interpreting graphs of distributions for the variable to be the x-axis, I proposed the first idea).

ridgeline plots for stratifying clinics based on their distributions

Groupings might be clinics whose distributions fall into deciles for the variable, ranging from the lowest decile to the highest decile on that variable up the y-axis. That way, we'd be able to see and tease apart how the distribution looks for clinics that fall into each %ile. I think we want to see 10 groups (i.e., deciles).
Key for the ridgeline plots: we want to get the distributions for particular clinics (sta6a), i.e., use the average distribution of that variable over time for a given clinic, which clinics stratified and displayed in their deciles.

Faceting to display 10 stratified groups combined with other packages

I see that @clauswilke has a package ggridges producing something awesome that looks like the Joy Division Unknown Pleasures album art )

Get @clauswilke's ggridges from CRAN

install.packages("ggridges")

Or latest development version from GitHub https://github.com/clauswilke/ggridges:

library(devtools)
install_github("clauswilke/ggridges")

Produce plots that show the distributions for the stratified deciles.

The clinics that fall into that decile each have a tick mark on the y-axis, and the variable distribution is on the x-axis. Note: I did not ensure this is operable code, just sort of sketching out what I the code likely would be:
```
library(ggridges)
dat %>% mutate(group = reorder(decile, variable, median)) %>%
ggplot(aes(x = variable, y = decile, height=..density..)) +
geom_density_ridges(scale = 10)
```
Facet these to see 10 deciles in a panel side by side (only 5 shown here)

We may consider also whether there are bivariate relations that we would like to show this way (I will think more about which bivariate relationships). As shown with the diamonds dataset here:

letter-value plots

by @hadley https://github.com/hadley/lvplot
NOTE: I considered this approach too, but I'm not sure that generating and visualizing the letter value summaries for these plots are as readily interpretable for most viewers as compared to the violin plots and ridgeline plots above so, I ruled this out for now.
CITATION: Heike Hofmann, Hadley Wickham & Karen Kafadar (2017) Letter-Value Plots: Boxplots for Large Data, Journal of Computational and Graphical Statistics, 26:3, 469-477, DOI: 10.1080/10618600.2017.1305277

Available at https://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1305277?journalCode=ucgs20

# install.packages("devtools")
devtools::install_github("lvplot/hadley")

lzim / teampsd

Review plots code MERGE with Issue #76 #205

mean centered

violin plots for understanding how distributions of data vary over time.

ridgeline plots for stratifying clinics based on their distributions

Get @clauswilke's ggridges from CRAN

Produce plots that show the distributions for the stratified deciles.

Facet these to see 10 deciles in a panel side by side (only 5 shown here)

We may consider also whether there are bivariate relations that we would like to show this way (I will think more about which bivariate relationships). As shown with the diamonds dataset here:

letter-value plots