sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.39k stars 133 forks source link

missing column names after crosstab? #77

Closed randomgambit closed 6 years ago

randomgambit commented 7 years ago

Hello there,

First of all, congrats for this package. Coming from stata and pandas, this is EXACLY what dplyr needed. A robust, easy to use package that excels at computing tabulations and cross-tabulations!

I have a quick question. Consider this simple example


var1<- c(5,10,2,2,NA)
var2<- c(NA,NA,NA,NA,NA)
var3<- c(5,10,2,2,1)
var4<- c('A','B','C','C','D')
var5<- c('G','B','C','F','D')
df = df=data.frame(var1,var2,var3,var4,var5)

> df
  var1 var2 var3 var4 var5
1    5   NA    5    A    G
2   10   NA   10    B    B
3    2   NA    2    C    C
4    2   NA    2    C    F
5   NA   NA    1    D    D

Now I compute a simple cross-tabulation

> df %>% crosstab(var1, var3,  percent = "row")
  var1 1 2 5 10
1    2 0 1 0  0
2    5 0 0 1  0
3   10 0 0 0  1
4   NA 1 0 0  0

I dont get why I only see the variable name var1 above and not var1 and var3. Given the table above, how could you possibly know that this is a cross-tabulation between var1 and var3? Am I missing something?

Thanks!

randomgambit commented 7 years ago

by looking at the documentation, it seems this is the expected behavior of crosstab.

In my opinion, there is room for improvement here. In essence, by not showing the name of the second variable, we are losing information and our two-way frequency tables are not self-contained anymore. For a package specifically designed at cross-tabulations, I think this is a pretty serious issue.

What do you think? In any case, your package is really great and I hope my comments will help you improve it!

Example output in Stata where all the information is kept.

. tab make foreign if  _n <5

                   |  Car type
    Make and Model |  Domestic |     Total
-------------------+-----------+----------
       AMC Concord |         1 |         1 
         AMC Pacer |         1 |         1 
        AMC Spirit |         1 |         1 
     Buick Century |         1 |         1 
-------------------+-----------+----------
             Total |         4 |         4 
sfirke commented 7 years ago

I agree that the ideal result would contain information about both variables used in the crosstab and have actually put considerable thought into this. I'm open to other ideas, but I haven't found a good way to do that without compromising the data.frame structure of the result.

For interactive use, something like that Stata result is nice, to have the full info. But then the result can't be written to .csv, passed to adorn_crosstab(), joined to another data.frame, etc.

gmodels::CrossTable does something very similar to that Stata output: http://stackoverflow.com/questions/13043817/mimic-tabulate-command-from-stata-in-r Perhaps that is your best bet. Or from this SO answer about xtabs:

t1 <- rep(c("A","B","C"),5)
t2 <- rpois(15,4)
df <- data.frame(ques=t1,resp=t2)
z <- xtabs(~ques+resp,data=df)

> z
    resp
ques 1 3 4 5 6 8 9
   A 0 3 0 1 0 0 1
   B 1 1 2 0 0 1 0
   C 0 0 2 1 1 1 0

But I don't like that z is of class "table". And the var2 name is lost when printing with knitr::kable or writing as a .csv.

write.csv(z, "dummy.csv")
knitr::kable(z)

and as.data.frame(z) gives an undesirable long result.

I don't see where the var2 name could go in a data.frame. I tried printing it for interactive use like this:

crosstab2 <- function(...){
  dots <- as.list(substitute(list(...)))[-1L]
  print(paste0("          ", deparse(dots[[2]])))
  crosstab(...)
}
> crosstab2(mtcars$cyl, mtcars$carb)
[1] "          mtcars$carb"
  mtcars$cyl 1 2 3 4 6 8
1          4 5 6 0 0 0 0
2          6 2 0 0 4 1 0
3          8 0 4 3 6 0 1

But in interactive use, you already have your command right above it the console output, so it's not as much of an issue - and then if the printed statement doesn't stay with the resulting data.frame, it doesn't solve the underlying problem.

Here's an Excel screenshot from trying to conceptualize this:

image

I don't think the var2 name can be included without creating a blank column.

randomgambit commented 7 years ago

thanks for your detailed and nice answers. One possibility: what about concatenating in some form or another the two column names, so that the output is

> df %>% crosstab(var1, var3,  percent = "row")
  var1/var3 1 2 5 10
      1    2 0 1 0  0
      2    5 0 0 1  0
      3   10 0 0 0  1
      4   NA 1 0 0  0
sfirke commented 7 years ago

I see the upside of this, but am stuck on two problems:

  1. var1/var3 doesn't describe the contents of column 1; that's var1. This breaks an R convention and is confusing if say you want to %>% filter(var1 == 3).
  2. It's an illegal variable name with that special character, so subsequent function calls would have to handle that.

My solutions are (a) for interactive use, you have the syntax immediately before the result so can look there; (b) for scripted use you can add a title in Markdown or tweak the first column name with kable().

randomgambit commented 7 years ago

@sfirke !!! I wrote this in October!!!!! :D but, seriously, why dont you just add an option that says rowcol = True that, instead of var1/var3 writes the first 3 character of each variable separated by a - ? say, if the variables are samuel and firke it would show sam-fir?

sfirke commented 7 years ago

I'm hoping to get faster at responding :) Cleaning up some issues now before a minor CRAN submission.

That approach sounds like a good compromise. I'm going to redo the whole crosstab() function later this year, and I'll look at that as a possible option for crosstab() to take.

sfirke commented 7 years ago

A year later and I'm struck by an idea.

To recap, this looks pretty nice with "cyl" on top:

> library(magrittr)
> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

But it's not a data.frame.

This is a data.frame:

result <- mtcars %>%
  tabyl(am, cyl)

top <- result[1, ]
top[1, ] <- names(top)

bind_rows(
  top, result %>%
    mutate_if(is.numeric, as.character)
) %>%
  setNames(c("", "cyl", "", ""))

     cyl     
1 am   4 6  8
2  0   3 4 12
3  1   8 3  2

It's no longer any good for manipulating, since the column names are put in the 1st row of the data.frame - feels so wrong. BUT, it looks good for printing in a report. I think I could implement this as an adorn_, like adorn_col_title. I'd need to store the variable names as tabyl attributes, not hard, then it could do this.

I think it's at least worth a shot. Anyone want to implement this? The attribute stuff is not hard if you look at as_tabyl for guidance.

I wonder if it would throw off any bolding of column names that might occur in RMarkdown or another output format like a .docx template; would it bold only that single name in the row by itself?

sfirke commented 7 years ago

Note to self: allow it to take optional text argument to put on top. Default is to get it from the tabyl attributes, but allow for override. Use cases are overriding on a tabyl, or using with a non-tabyl, like some other tidyverse calculation ending with spread.

randomgambit commented 7 years ago

Hi @sfirke sorry for my late andwer. Yes, I think this is a very good idea, and actually the only possible one when you think about it but why not sticking to

> mtcars %$% table(am, cyl)
   cyl
am   4  6  8
  0  3  4 12
  1  8  3  2

at the end of the day? for printing, nobody cares whether this is a dataframe or a duck

sfirke commented 7 years ago

I agree that table is satisfactory in your example. But what if I want to print something like this:

> mtcars %>%
+   tabyl(am, cyl) %>%
+   adorn_percentages() %>%
+   adorn_pct_formatting()
  am     4     6     8
1  0 15.8% 21.1% 63.2%
2  1 61.5% 23.1% 15.4%

table can't do the % stuff, but I still might want that cyl at the top.

I'm tinkering with this and it should be doable.

sfirke commented 6 years ago

Might consider kableExtra::add_header_above. That could be inspiration, though it seems targeted at more heavy-weight LaTeX and HTML tables while this is still a simple data.frame.

sfirke commented 6 years ago

This feature is now live on the master branch! Try out adorn_title(). I did some data analysis today and found myself reflexively using it, which is a good sign. It's a weird function so would love feedback from anyone.

randomgambit commented 6 years ago

thanks that is great! sorry for getting back to you so late

vishalugle commented 6 years ago

Hi,

Thanks for adding this feature, I am using version 0.3.1 of janitor package and don't see adorn_title() in it. How can I get the latest version of the package ?

Thanks

sfirke commented 6 years ago

@vishalugle I plan to submit version 1.0 to CRAN tomorrow at which point this will be a moot question 😄 But in the meantime and for installing other package development versions from GitHub, try:

install.package(devtools)
devtools::install_github("sfirke/janitor")

See if that gives you that function?

vishalugle commented 6 years ago

Thanks for your quick response , I got the latest version using devtools.

On Thu, Mar 8, 2018 at 12:19 PM, Sam Firke notifications@github.com wrote:

@vishalugle https://github.com/vishalugle I plan to submit version 1.0 to CRAN tomorrow at which point this will be a moot question 😄 But in the meantime and for installing other package development versions from GitHub, try:

install.package(devtools) devtools::install_github("sfirke/janitor")

See if that gives you that function?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sfirke/janitor/issues/77#issuecomment-371557692, or mute the thread https://github.com/notifications/unsubscribe-auth/AOZTR7i5yf7Qf7nUKNmL-3fY9aMgWeBsks5tcWgQgaJpZM4KhajE .