sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

weights for tabyl #183

Open tklebel opened 6 years ago

tklebel commented 6 years ago

Feature requests: weights

It would be great, if one could specify weights when using tabyl. For survey data weights are very common, without them using tabyl does not make much sense in this case.

If the implementation is more or less straightforward and within the scope of your package, I would be happy to assist with a pull. I looked at the source for tabyl and it seems to me that passing down an argument to dplyr::count for wt like the following should be enough?

library(dplyr)

test_df <- tribble(~x, ~wt,
                   "a", 1,
                   "a", 1,
                   "b", .5,
                   "b", .5)

test_df %>% 
  count(x)
#> # A tibble: 2 x 2
#>   x         n
#>   <chr> <int>
#> 1 a         2
#> 2 b         2

test_df %>% 
  count(x, wt = wt)
#> # A tibble: 2 x 2
#>   x         n
#>   <chr> <dbl>
#> 1 a      2.00
#> 2 b      1.00

Or is there something I am overlooking?

sfirke commented 6 years ago

Thanks for this clearly stated feature request! I have been thinking about it and feel unsure. I see questions of fit and implementation.

Fit Someone else has mentioned wanting this before and it seems like a useful feature to some users. Janitor's boundaries aren't very crisp, but while tabyl could be seen as a data cleaning tool (e.g., exploring a variable), this starts to be more of a purely data analysis feature. I don't know of a tidy tools package for survey data specifically, which might be the perfect home for this.

Implementation You're right about modifying dplyr::count to use the wt argument. This change would be simple in janitor 0.3.1 but more complex in the new version, because tabyl now includes the former function crosstab and a 3-way version as well. Do/would you use weighting in those contexts, that is, counts of 2-3 variables with weighting by another one? So say with dplyr, count(mtcars, cyl, am, wt = mpg). I ask genuinely and would love to hear from anyone on this, as I don't use weighting like this in my own analyses.

If it makes sense and would be used in the 2- and 3- variable contexts, then I think it's implementable. It's yet another argument to tabyl but that's an acceptable trade-off if enough people would use it.

Would love to hear from other users and janitor contributors/stakeholders!

tklebel commented 6 years ago

I have to admit, I am not an expert when it comes to survey weights, but from what I gather, the point is as follows:

In surveys you might have at least two variants of weights: design weights to counter over-sampling of different sub-populations, and post-stratification weights to counter nonresponse among other things. Especially when counting and cross-tabulating, those weights need to be incorporated, because percentages would otherwise not be representative for the population.

For weighting this would mean that you sum the weights per group, as dplyr::count does, instead of simply counting categories. For your question then, this would mean, yes, for counts of 2-3 variables you would wish to weight by another variable. But it seems not to be as straightforward, as I thought before. The total n of a crosstable, for example, should not be the sum of the weights, but still the unweighted n, since the sample size stays the same, whether you weight the cases or not. This however would probably be difficult to implement in janitor, since the adorn_totals function would need to be changed as well.

All in all, it seems to me that a separate package for tidy survey analysis, or at least for for survey crosstabs would be a better fit than your package. It just feels a pity to start within one package (janitor) to explore variables (with a syntax which works great) and then need to move to something else for correct counts. From the user perspective weighting should be a simple task (like in SPSS you simply "throw a switch" and "it works", although as usual it is tough to know, what SPSS really does). But for implementation, it is probably not as straightforward.

sfirke commented 6 years ago

I was just looking at the questionr package, for survey analysis, and it has a function similar to tabyl but supporting weighting. Perhaps the answer here is to tackle the survey analysis with questionr which is specifically survey-oriented?

ghost commented 6 years ago

The problem with the questionr package is not that it's not tidyverse friendly, so it doesn't really answer to the need for a version of tabyl() that allows weighting. I've started using the janitor package, and find it very useful, but for my final analysis which requires weights, I'm unfortunately back to having to create my own ad hoc functions.

jackobailey commented 5 years ago

I'd second that weighting would be useful. You've already taken a step into analysis in implementing the chisq.test() and fisher.test() functions. Why not allow for weighting too?

sfirke commented 5 years ago

Is srvyr an acceptable, tidyverse alternative? https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html

ghost commented 5 years ago

I've lately been using wt argument in dplyr::count() to set up my own version of tabyl(). I've found it more flexible than using survey/srvyr.

markhwhiteii commented 5 years ago

for what it's worth, here's something I wrote that uses count() and then some adorn_ functions later, to get around the weighting issue:

xtab_1v <- function(data, dv, iv, weight = NULL) {
  data %>% 
    mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>% 
    {
      if (is.null(weight)) count(., !!sym(dv), !!sym(iv)) 
      else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
    } %>% 
    group_by(!!sym(dv)) %>% 
    mutate(pct = n / sum(n)) %>% 
    ungroup() %>% 
    select(-n) %>% 
    spread(!!sym(iv), pct) %>% 
    janitor::adorn_pct_formatting() %>% 
    janitor::adorn_ns(ns = {
      data %>% 
        mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>% 
        {
          if (is.null(weight)) count(., !!sym(dv), !!sym(iv)) 
          else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
        } %>% 
        mutate(n = round(n, 1)) %>% 
        spread(!!sym(iv), n, fill = 0)
    }) %>% 
    mutate(variable = dv) %>% 
    rename("value" = dv) %>% 
    `[`(TRUE, c(length(.), 1:(length(.) - 1)))
}

We can try with:

set.seed(1839)
data <- data.frame(
  x = sample(letters[1:2], 200, TRUE), 
  y = sample(letters[3:4], 200, TRUE), 
  weight = runif(200)
)
xtab_1v(data, "x", "y", "weight")

Which returns:

# A tibble: 2 x 4
  variable value c            d           
  <chr>    <fct> <chr>        <chr>       
1 x        a     48.7% (27.3) 51.3% (28.8)
2 x        b     48.8% (22.1) 51.2% (23.2)

As always, be very careful when using weights in R. Make sure you know what you're doing, as there are many different types of weights out there. This function should be fine, since where the real issues come up are when calculating standard errors.

jcthrawn commented 4 years ago

I do agree. This is a good function and a good package, tidyverse compatible, and the adorn stuff is perfect. But as I work with data coming from the french public statistic system, everything is weighted and I really cannot use it ! Weights for tabyl would be great !

markhwhiteii commented 4 years ago

Perhaps the srvyr package could help with this?

library(tidyverse)
library(srvyr)

set.seed(1839)
dat <- tibble(
  x = factor(rbinom(200, 1, .5)), 
  y = factor(rbinom(200, 1, .5)), 
  w = runif(200)
)

dat %>% 
  as_survey() %>% 
  group_by(x, y) %>% 
  summarise(pct = survey_mean())
# A tibble: 4 x 4
  x     y       pct pct_se
  <fct> <fct> <dbl>  <dbl>
1 0     0     0.491 0.0478
2 0     1     0.509 0.0478
3 1     0     0.5   0.0528
4 1     1     0.5   0.0528
olanderb commented 4 years ago

Hi all, I would like to second the requests for adding a weights option for tabyl. I'm a novice user of R trying to convince my colleagues to pick up R for it's many advantages including reproducibility.
However, something simple like making a 2 X 2 table with % with weights (dpylr compataible) isn't quite so straightforward in R to a rookie.

Thanks for considering.

sfirke commented 4 years ago

I am open to adding this feature, if people are confident it's not already implemented in another package (e.g., @markhwhiteii gives the example from srvyr last week) or if it would be worthwhile to add it here even if it exists elsewhere. In short it would add an argument "wt" to tabyl() and weight the according 1-, 2-, or 3- way tabyls accordingly. (yes?)

I don't have the time to implement this feature, though, so someone would need to own that. Design what users want & what will be able to be implemented, then create the code and tests on a fork and submit a pull request. I can advise and give some feedback, especially where it relates to the internal code of tabyl (which is approachable, I don't use anything ultra advanced, but it makes more sense after you get oriented to it, especially if you're not familiar with S3 methods).

I also don't know much about weighting in analysis so would need that perspective represented as well.

jcthrawn commented 4 years ago

Hi everybody,

I can be wrong because I didn’t test it a long time, but survey/srvyr doesn’t seem to bring much things more for weighted tables than dplyr. We can simply do :

tab <- dat %>% group_by(var1, var2) %>% summarize(n=sum(wt)) %>% spread(var2, n)

It would be OK for me to make my own functions and to use tabyl::adorn, which I really love, on the result. But I also think it’s better when it’s straightforward and an absolute beginner can easily read the code, especially when we talk about something as basic as a crosstab. To teach statistics to students of social sciences who know nothing about programming, I am currently using SAS and would love to turn to R, because it’s free and works better : but I hesitate when I see all that you need to code to produce, format and export a simple tab. You have created an efficient and easy function to make formatted crosstabs in a tidy way, it would be a pity if half the world cannot use it because he needs weights ! And we cannot pipe any weighted data into the tabyl function : it have to be done when you count the frequencies. A simple argument "wt" for all the 1-, 2- and 3- tabyls would effectively be enough. Since a weight variable is simply numeric, generally don’t have any NAs, and you just have to do the multiplication once, it musn’t create many problems (something a bit more complicated would be an option to do normalized weights, which ensures that the total number of individuals doesn’t change / that the sum of weights are equals to n ; but I think many of us don’t really need this, at least with representative samples weighted to obtain the overall population number).

I think I unfortunately cannot implement the feature myself, because I am really programming with R since only two weeks, I still often produce quite "ugly" code and don’t know what S3 methods are ! But I can test it with different data and give feedback.

sfirke commented 3 years ago

I want to check my understanding here. Would the wt column be a numeric vector present in the input data.frame, and at the stage of making the counts (of either 1 variable or combinations of 2 variables), we would multiply the counts by the wt variable? Then everything else would proceed like it already does - the adorn_ functions, etc?

BriceNocenti commented 3 years ago

For a minimal weight argument that's it : the counts just have to be multiplied once, and all other functions can just stay the same (if the user knows he can't do chisq tests or calculate confidence intervals over a weighted table : but putting both weighted and unweighted results in attributes, and modify functions to access the base counts, won’t be needed because it's not the point of janitor).

gdutz commented 3 years ago

The easiest solution I came up with was just to use the wt argument of the dplyr::count() function (as mentioned above by @tklebel and others). I had to duplicate parts of the code for calling count() with and without the wt argument (checking with missing()). Duplicating the code feels a bit unelegant, but at least it seems to work: I crosschecked the results with the table command and fweight in Stata and the results are identical, even for the 3-way-tables.

I am sure my solution needs some testing and can be improved (checking for NAs for example) but I think that's pretty straight forward. Implementing normalized weights (as suggested by @jcthrawn) should be pretty easy as well. And since I basically just added another call of count() I hopefully did not break anything else...

I am a bit hesitant to just open a pull request since I did only a very limited amount of testing. You could perhaps look at the commit in my fork and decide if there should be any changes beforehand: gdutz/janitor@61540bf44fcf506e1d12a36c525e7b38ffc6409a

Sopwith commented 2 years ago

Just wanted to chime in that I would love if this package could replace my haphazard personal functions I use to generate weighted tables and crosstabs. When I work with data, it's almost always weighted. (I get the feeling that those who use weights, use them a lot.) I am able to generate my weights, but having them come through in a one-line call is important. I would like to jump on the bandwagon for this to be implemented, if there is a wagon to get on.

benzipperer commented 1 year ago

I'd like to add another request that a weight option be added to tabyl. We wrote our own function to do similar crosstabs with weights but would love to just recommend that folks use the otherwise excellent tabyl.

The example by @gdutz in https://github.com/sfirke/janitor/issues/183#issuecomment-909129910 seems to work well.

sirojasv commented 1 year ago

I support the request to add support for weights. I am a university statistics professor and with this feature I could fully recommend the use of tabyl.

jschmittwdc commented 10 months ago

Thank you! Great package. Only wish it had weights.

SimonEdscer commented 8 months ago

I would like to reiterate the request to add a weight feature to this package :) It would avoid a lot of duplicated code using counts. Thank you

dakotainstitute commented 7 months ago

Just thought I would add my support for this as well. Tabyl is one of the best options I've found for making tabs and crosstabs. It would be much more useful if we could use weights.