Mixed up response and dependant variables in spearman2()

bogie commented 4 years ago

> getRversion()
[1] ‘3.6.3’

Tangram version: 0.7.1

I have the following dataframe:

'data.frame': 132 obs. of 2 variables: $ Group_DM2_Pre: Factor w/ 3 levels "Non-Diabetic",..: 2 2 2 1 3 2 3 2 NA 1 ... $ Age : num 65 58 75 74 83 61 70 48 59 62 ... ..- attr(*, "label")= chr "Age [years]"

Which yields the following table: Patients %>% tangram(x=Group_DM2_Pre ~ Age, test=TRUE)

              N   Non-Diabetic  Pre-Diabetic   Diabetic        Test Statistic      
                     (N=35)        (N=41)       (N=42)                             
-----------------------------------------------------------------------------------
Age [years]  132   57 *65* 74    59 *65* 76   60 *72* 81  F_{1,116}=4.67, P=0.03^3 
===================================================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon

As you can see it has a degree of freedom of 1 and parses that as a Wilcoxon test, replicating the spearman2() function I notice the following:

> spearman2(Age ~ Group_DM2_Pre,Patients)

Spearman rho^2    Response variable:Age

               rho2    F df1 df2      P Adjusted rho2   n
Group_DM2_Pre 0.046 2.77   2 115 0.0668         0.029 118
> spearman2(Group_DM2_Pre ~ Age,Patients)

Spearman rho^2    Response variable:Group_DM2_Pre

     rho2    F df1 df2      P Adjusted rho2   n
Age 0.039 4.67   1 116 0.0328          0.03 118

it uses the second formula(thus df=1) for approximating the K-W statistic(or in this case wilcoxon), however if I understand the concept behind spearman2 correctly it should use the first formula instead. With three groups in Group_DM2_Pre the df should be df=k-1=3-1=2 right?

Doing the K-W test manually I get the following result:

> kruskal.test(Age ~ Group_DM2_Pre, Patients)

    Kruskal-Wallis rank sum test

data:  Age by Group_DM2_Pre
Kruskal-Wallis chi-squared = 5.3789, df = 2, p-value = 0.06792

Which is what I would expect and is also in line with the spearman2 approximation from the first formula.

Using tangram(Age ~ Group_DM2_Pre, Patients, test=TRUE) I get the expected p-values but obviously a completely different layout:

===============================================================
                N     Age [years]          Test Statistic      
                        (N=132)                                
---------------------------------------------------------------
Group_DM2_Pre                         F_{2,115}=2.77, P=0.07^1 
  Non-Diabetic  49     57 *65* 74                              
  Pre-Diabetic  55  58.7 *65.0* 76.0                           
  Diabetic      56  60.0 *72.5* 81.2                           
===============================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.

Help would be greatly appreciated, since I would really like to use tangram for my doctoral thesis.

Modifying summarize_kruskal_horz and removing the "c()" around datac and datar this way:

  if(inherits(test, "function"))
  {
    stat <- test(row, column, cell_style, ...)
    test <- TRUE
  } else if(length(categories) == 1) stat <- "" else
  # Kruskal-Wallis via F-distribution
  {
    tst  <- suppressWarnings(spearman2(datac, datar, na.action=na.retain))
    stat <- cell_style[['fstat']](
        f         = render_f(tst['F'], "%.2f"),
        df1       = tst['df1'],
        df2       = tst['df2'],
        p         = cell_style[['p']](tst['P'], pformat))
  }

yields the expected results:

> tangram(Group_DM2_Pre ~ Age, Patients, test=TRUE, transforms = publication)
======================================================================
              N   Non-Diabetic  Pre-Diabetic  Diabetic  Test Statistic
                     (N=35)        (N=41)      (N=42)                 
----------------------------------------------------------------------
Age [years]  132     63 (3)        66 (2)      71 (2)     P=0·07^1    
======================================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.

publication() holds a near copy-paste of the lancet style and hmisc functions while solely overwriting summarize_kruskal_horz

is this a known bug or am I misunderstanding the whole process? :D

spgarbet commented 4 years ago

Just at a quick glance, the formulas are backwards for table layout than how they would go into a model function. The problem is that one has a choice between the following having

Rows ~ Cols or Cols ~ Rows

If one choses Rows ~ Cols, then it corresponds to R formula order for Dependent ~ Independent for statistical functions. But then the formula going into the table looks like y1 + y2 + ... yn ~ x.

If one choses Cols ~ Rows, then it is backwards to the normal R formula order, Independent ~ Dependent, but the formula says x ~ y1 + y2 + ... yn. Which seems more natural for specifying a table, however the text in each from is yi ~ x as you noticed.

It's been published for sometime with this choice, it would be a breaking change for users if I were to flip at this point.

bogie commented 4 years ago

So you are saying this is the expected output/formula?

tangram(Age ~ Group_DM2_Pre, Patients, test=TRUE)

===============================================================
                N     Age [years]          Test Statistic      
                        (N=132)                                
---------------------------------------------------------------
Group_DM2_Pre                         F_{2,115}=2.77, P=0.07^1 
  Non-Diabetic  49     57 *65* 74                              
  Pre-Diabetic  55  58.7 *65.0* 76.0                           
  Diabetic      56  60.0 *72.5* 81.2                           
===============================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.

The example suggests that this is not at all what would be the expected formula and format.

All I am saying is that the spearman2 function appears to receive the wrong values and this is exclusive to the summarize_kruskal_HORZ method, as the _vert method does not put c() around datar and datac and delivers the expected result and statistics. removing c() yields the appropriate result.

Please help me understand what the proper approach would be, as to my understanding using a wilcoxon test in a three sample data set would be wrong as you would need to do a kruskal-wallis test, right?

spgarbet commented 4 years ago

I was in hurry before and misread what you were saying. I've confirmed that Hmisc returns the behavior you describe, and the horz is wrong. The commit that changed this was fe7a9c91adf5c3ce9d3dd3af350f29ecf14f57b5. I remember now, there was a very huge file with a lot of data and spearman2 was very slow in performance without this change, but it was because someone passed in labelled integers and this was stripping the label. The problem is now, this is impacting factors. I don't know why labels made it slower, but I'm going to revert this change.

spgarbet commented 4 years ago

@bogie I'm always open to providing more transforms with different statistical goals. If your dissertation leads to some useful transforms, I would love to include them.

spgarbet / tangram

Mixed up response and dependant variables in spearman2() #66