ropensci / visdat

Preliminary Exploratory Visualisation of Data
https://docs.ropensci.org/visdat/
Other
450 stars 47 forks source link

vis_compare expansion #109

Open Maschette opened 5 years ago

Maschette commented 5 years ago

Suggestion for a new function which would essentially be an extension of vis_compare but has colors to specify what the 'not same' change is. Admittedly this is limited use case as too many variables and it would be super messy. The main thing purpose I was thinking of is tracking changes in genetics technical replicates where the possible values are 0, 1, 2, NA and you want to keep track of what these values change to between replicates.

Halfway through writing this I thought I may as well have a crack, this works although the 'new bit' could be a better, I'm primarily a base kid so that is what I did it in rather than mutate. Also I didn't use your colors and went with viridis. Finally the only thing I didnt do out of lazyness was NA's being used in from/to scenarios should be "NA" so that ggplot doesn't remove them and they get a color.

You could also not implement this if you think it is weird and that would be super fine.

vis_compare_new <- function(df1,
                        df2, type="same"){

  # throw error if df1 not data.frame
  visdat:::test_if_dataframe(df1)

  # throw error if df2 not data.frame
  visdat:::test_if_dataframe(df2)

  if (!identical(dim(df1), dim(df2))) {
    stop("vis_compare requires identical dimensions of df1 and df2")
  }

    v_identical <- Vectorize(identical)
    df_diff <- purrr::map2_df(df1,df2, v_identical)
    head(df_diff)
    d <- df_diff %>% as.data.frame() %>% purrr::map_df(visdat:::compare_print) %>% 
        visdat:::vis_gather_() %>% dplyr::mutate(value_df1 = visdat:::vis_extract_value_(df1), 
        value_df2 = visdat:::vis_extract_value_(df2))
#The new bit
    if (type!="same"){
    cols<-c('value_df1','value_df2' )
    d$fctr <- apply( d[ , cols ] , 1 , paste , collapse = "-" )
    d$fctr[d$valueType=="same"]<-"same"
    d$value_df1<-as.character(d$value_df1)
    d$value_df2<-as.character(d$value_df2)
    d[,cols][d$valueType=="same",]<-"same"
    }
    fillType<-dplyr::case_when(
      type == "same"~"valueType", 
      type == "from"~"value_df1", 
      type == "to"~ "value_df2",
      type == "both"~"fctr")

ggplot2::ggplot(data = d, ggplot2::aes_string(x = "variable", y = "rows")) + 
  ggplot2::geom_raster(ggplot2::aes_string(fill = fillType)) + 
  ggplot2::theme_minimal() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 1, hjust = 1)) + 
  ggplot2::labs(x = "", y = "Observations", fill = "Cell Type") + 
 # ggplot2::scale_fill_manual(limits = c("same", "different"), breaks = c("same", "different"), values = c("#fc8d59", 
  #      "#91bfdb"), na.value = "grey") + 
  ggplot2::scale_y_reverse() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(hjust = 0.25)) + 
  ggplot2::scale_x_discrete(position = "top", limits = names(df_diff))+viridis::scale_fill_viridis(discrete = TRUE)
}

vis_compare_new(df1, df2, type = "both")
njtierney commented 5 years ago

Heya @Maschette ! :)

Thanks for putting the time into this :)

Do you think you could provide an example of the kinds of data you were imagining being compared here? I think that the idea is worthwhile exploring!

Maschette commented 5 years ago

Hey @njtierney, No worries, it was surprisingly quick. I have been thinking on this and it may be worth making it a new function vis_compare_dif maybe? the idea would then to be to add a rm.same option for if you want to filter out the ones that are the same and just display the differences.

Anyway, use case: this is a subset of genetics data from a technical replicate.

x1<-data.frame(x = c(NA, 2, 1, 2, 2, NA, 2, 2, 2, 2, 2, 
NA, 2, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, 2, 2, 2, 2, 2, 0, 2, NA, 
0, 2, 2, 2, 2, 2, 2, 2, NA, 0, 2, NA, NA, 0, 2, 2, NA, 2, 2, 
NA, 1, NA, 2, NA, 2, NA, 2, 0, NA, 2, 2, 0, NA, 2, NA, 2, 2, 
NA, 2, 0, NA, 2, 2, 2, 2, NA, 2, 2, 2, NA, NA, NA, NA, 2, NA, 
2, NA, NA, 2, 2, NA, NA, NA, 2, NA, 2, 2, NA, NA), y = c(NA, 
2, 1, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 
NA, 2, 2, 2, 2, NA, 2, 0, 2, NA, 0, 2, 2, 2, 2, 2, 2, 2, 0, NA, 
2, NA, NA, 0, 2, 2, NA, 2, 2, 2, NA, NA, 2, 2, 2, NA, 2, 0, 2, 
2, 2, 0, 0, 2, NA, NA, 2, NA, 2, 0, NA, 2, 2, 2, 2, 2, 2, NA, 
2, 2, 2, 2, NA, 2, NA, 2, NA, NA, 2, 2, NA, 2, NA, 2, NA, 2, 
2, NA, NA))

vis_compare_new(x1[1], x1[2], type = "both")

image

The other option would be to display both columns of data and show side by side how they are different. vis_compare_new(x1, rev(x1), type = "from") image

njtierney commented 5 years ago

I like this a lot!

I would like to include this in visdat!

Two things to think about:

  1. The name - something to indicate that it is visualising the change/state/shift? would vis_shift make sense to you?

  2. The documentation about expected use - here it seems that comparison of columns is the focus, is that correct?

Thanks again for taking the time to do this, I really like it!

Maschette commented 5 years ago

Hi Nick, On your two points:
1: I like vis_change, or maybe vis_diff? 2: Yeah column comparison is the main use case I was thinking of but you should be able to use it for other things.

njtierney commented 5 years ago

Hi Dale,

  1. How about plural: vis_changes() or do you think singular vis_change() makes more sense for you? vis_diff() makes me think of git diff, but maybe that evokes more of what you think this would be used for?

  2. Sounds good!

  3. I'm not sure about the option type = "both"/"from" - perhaps something more verbose like show_both?

  4. Would this work for two data.frames?

Maschette commented 5 years ago
  1. vis_change() sounds good
  2. there is also a type = "to" so if you know all your data should be for example 0 you can see what it changes to. so maybe show="both" as default?
  3. yes it does.

This is where go to; since it would be a new function I removed the "same" option from case_when() the thing that would be cool would be to work out if you are comparing data frames with different names to have the names of both the columns in the x-axis.

vis_change <- function(df1,
                        df2, show="both"){

  # throw error if df1 not data.frame
  visdat:::test_if_dataframe(df1)

  # throw error if df2 not data.frame
  visdat:::test_if_dataframe(df2)

  if (!identical(dim(df1), dim(df2))) {
    stop("vis_compare requires identical dimensions of df1 and df2")
  }

    v_identical <- Vectorize(identical)
    df_diff <- purrr::map2_df(df1,df2, v_identical)
    head(df_diff)
    d <- df_diff %>% as.data.frame() %>% purrr::map_df(visdat:::compare_print) %>% 
        visdat:::vis_gather_() %>% dplyr::mutate(value_df1 = visdat:::vis_extract_value_(df1), 
        value_df2 = visdat:::vis_extract_value_(df2))
#The new bit
    if (type!="same"){
    cols<-c('value_df1','value_df2' )
    d$fctr <- apply( d[ , cols ] , 1 , paste , collapse = "-" )
    d$fctr[d$valueType=="same"]<-"same"
    d$value_df1<-as.character(d$value_df1)
    d$value_df2<-as.character(d$value_df2)
    d$value_df1[is.na(d$value_df1)]<-"NA"
    d$value_df2[is.na(d$value_df2)]<-"NA"

    d[,cols][d$valueType=="same",]<-"same"
    }
    fillType<-dplyr::case_when(
      show== "from"~"value_df1", 
      show== "to"~ "value_df2",
      show== "both"~"fctr")

ggplot2::ggplot(data = d, ggplot2::aes_string(x = "variable", y = "rows")) + 
  ggplot2::geom_raster(ggplot2::aes_string(fill = fillType)) + 
  ggplot2::theme_minimal() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 1, hjust = 1)) + 
  ggplot2::labs(x = "", y = "Observations", fill = "Cell Type") + 
 # ggplot2::scale_fill_manual(limits = c("same", "different"), breaks = c("same", "different"), values = c("#fc8d59", 
  #      "#91bfdb"), na.value = "grey") + 
  ggplot2::scale_y_reverse() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(hjust = 0.25)) + 
  ggplot2::scale_x_discrete(position = "top", limits = names(df_diff))+viridis::scale_fill_viridis(discrete = TRUE)
}
x1<-data.frame(x = c(NA, 2, 1, 2, 2, NA, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, 2, 2, 2, 2, 2, 0, 2, NA,  2,NA, 2, 0, NA, 2, 2, 2, 2, NA, 2, 2, 2, NA, NA, NA, NA, 2, NA, 2, NA, NA, 2, 2, NA, NA, NA, 2, NA, 2, 2, NA), 
               y = c(NA, 2, 1, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, 2, 2, 2, NA, 2, 0, 2, NA, NA, 2, 0, NA, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, NA, 2, NA, 2, NA, NA, 2, 2, NA, 2, NA, 2, NA, 2, 2, NA, NA),  
               z = c( 0, 2, NA, 0, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, NA, NA, 0, 2, 2, NA, 2, 2, 2, NA, NA, 2, 2, 2, NA, 2,  2, 2, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, 2, 2, 2, NA, 2, 2, NA, 2, NA, NA, 2, 2, NA, 2, NA, 2, NA, 2, 2, NA, NA))

y1<-data.frame(w=c(2, 2, 0, NA, 2, NA, 2, 2, NA, 2, 0, NA, 2, 2, 2, 2, NA, 2, 2, 2, NA, NA, NA, NA, 2, NA, 2, NA, NA,  2, 2, 0, 2, NA, 0, 2, 2, 2, 2, 2, 2, 2, NA, 0, 2, NA, NA, 0, 2, 2, NA, 2, 2, NA, 1, NA, 2, NA, 2, NA, 2, 0, NA), q = c(NA,2, 1, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, NA, 2, 2, 2, 2, NA, 2, 0, 2, NA,  NA, 2, 0, NA, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, NA, 2, NA, 2, NA, NA, 2, 2, NA, 2, NA, 2, NA, 2, 2, NA, NA), d = c( NA, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, NA,NA, 2, 1, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2,  0, NA,2, NA, NA, 0, 2, 2, NA, 2, 2, 2, NA, NA, 2, 2, 2, NA, 2, 0, 2, 2, 2, 0, 0, 2, NA, NA, 2, NA, 2, 0))

vis_change(x1, y1, show= "to")
vis_change(x1, y1, show= "both")

image image

Maschette commented 5 years ago

oh it just occurred to me by removing the 'same' from type option you would also remove the if statement as it will always do it.