spgarbet / tangram

Table Grammar package for R
66 stars 3 forks source link

chi square for 1 ~ var1 + var2 #49

Open kkmann opened 5 years ago

kkmann commented 5 years ago

Hey,

absolutely love the package, can't wait for the JSS article! One quick question: When I do not have strata in the table, e.g. using the formula 0 ~ var1 + var2, what exactly are the the null hypotheses tested by hmisc? How do I suppress testing globally?

spgarbet commented 5 years ago

Each variable is treated separately. The tests would be 0 ~ var1 and 0 ~ var2 and the results pasted together. The package doesn't have a definition for using a 0 like that (although it could if you'd like to help me define expected behaviour).

It does support a '1'.

> library(tangram)
> tangram(1 ~ stage::Categorical, pbc)
======================================================
                                    N        All      
                                           (N=418)    
------------------------------------------------------
Histologic Stage, Ludwig Criteria  412                
   1                                    0.051   21/412
   2                                    0.223   92/412
   3                                    0.376  155/412
   4                                    0.350  144/412
======================================================

First this assumes the default table transform hmisc as an interpretation of the formula against the data to create the table. The hmisc transform does a Χ^2^ test under the null hypothesis that the categories are equally possible. The p-value and test are off by default on the current edge version on git. To turn them on, one can specify test=TRUE, which I just tried and it's broken for the intercept case. So I'm turning this into a bug ticket. These additional options are just passed to the transform function, so each transform can have different options.

Now one can always write their own transform if something else is desired. So, the question becomes when you typed the formula 0 ~ var1 + var2 what was the semantic meaning you would like to see?

spgarbet commented 5 years ago

I remember now, the behaviour to do a Chi^2 test in this case was requested to be disabled.

kkmann commented 5 years ago

thanks for the quick and exhaustive reply! Typo on my side, 1 ~ was what I did. Null makes sense although it would usually not be reported (weird med journal conventions on tables...).

What's the null for numerical variables then? It seems difficult to define a consistent null across numerical/categorical variables without stratification. Why not go for differences? 1 ~ (stage::Categorical == 1) + (bili <= 0) could be used to specify the null hypotheses (1: stage is always 1, 2: bili compared to point distribution on 0)

ahrakim commented 4 years ago

The null hypothesis with stratification for numeric variables is that means are equal. Without stratification, it would be like a one-sample t-test (one-sided or two-sided hypothesis) and we would have to enter an known mean. If you're suggesting to add a function that tests the hypothesis that bili is less than or equal to 0 for those with Stage 1, to me it makes more sense to conduct a separate test after subsetting the data. I think when these types of descriptive tables are created, generally people are trying to have a better sense of their subjects and do not know the information like mean bili levels beforehand.

Also, displaying results in a clear and efficient manner is another issue to think about. The p-value can be displayed under the test statistic, but the hypothesis being tested would have to be on a new row to avoid confusion. If testing for only one hypothesis, there would only be one p-value under the test statistic column in a table of many rows. I feel that it may look better if presented in its own separate table, along with all the other information that is provided from separate t-test like df, t-statistic, confidence interval..etc.

spgarbet commented 4 years ago

Generally a biomedical paper will have 3 tables, 1st is demographics (i.e. was randomization good and define the population of interest), 2nd is case/controls (what were the observed outcomes and differences) and 3rd is model results (how good does a given model perform). The hmisc transform is a really good default for the 1st table, and sometimes for the 2nd (pairing and smd needs differing treatment). However, there are other examples in the project for the 2nd and 3rd, with pairing, ratios and smd if you poke around.

spgarbet commented 4 years ago

http://htmlpreview.github.io/?https://raw.githubusercontent.com/spgarbet/tangram-vignettes/master/smd.html