spgarbet / tangram

Table Grammar package for R
66 stars 3 forks source link

Transform: N column, exclude NA values #67

Open bogie opened 4 years ago

bogie commented 4 years ago

So I encountered another problem/style question while using tangram. Currently the N column only removes NA values from datar, however it would be nice to also exclude entries which don't exist in the column.

For example I have the following table:

Patients %>% tangram(x=Group_DM2_Non ~ EndPoint,test=TRUE,transforms = publication, collapse=FALSE)
======================================================================
                        N   Non-Diabetic    Diabetic    Test Statistic
                               (N=76)        (N=42)                   
----------------------------------------------------------------------
Endpoint at close-out  122                                P=0.15^2    
   non-Survivor             19 (27·536%)  16 (41·026%)                
   Survivor                 50 (72·464%)  23 (58·974%)                
======================================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Mann-Whitney

While there it is correct that 122 patients had an endpoint at the end of the study only 108 patients had an endpoint AND were in either in the Non-Diabetic or Diabetic group:

Patients %>% filter(!is.na(Group_DM2_Non)) %>% tangram(x=Group_DM2_Non ~ EndPoint,test=TRUE,transforms = publication, collapse=FALSE)
======================================================================
                        N   Non-Diabetic    Diabetic    Test Statistic
                               (N=76)        (N=42)                   
----------------------------------------------------------------------
Endpoint at close-out  108                                P=0.15^2    
   non-Survivor             19 (27·536%)  16 (41·026%)                
   Survivor                 50 (72·464%)  23 (58·974%)                
======================================================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Mann-Whitney

While I could filter for those patients using dplyr, it would probably make sense to include this in the transform itself. However I can't wrap my head around parts of the API, any help would be appreciated!

spgarbet commented 4 years ago

That N value in the hmisc transform shows how many "Endpoints at close-out" were defined. What you're wanting is the column N value instead. If there were multiple rows of different variables this would end up being the same value over and over and not be descriptive of that rows data. One can see this by adding another row of data to the table that has a differing number of NAs present.

library(tangram)

pbc$surv <- factor(sample(c("Yes", "No", NA), length(pbc$drug), replace=TRUE))

sum(!is.na(pbc$surv))

tangram(drug ~ surv + sex, pbc, collapse_single=FALSE)

Which gives the following

> tangram(drug ~ surv + sex, pbc, collapse_single=FALSE)
===============================================================
            N   D-penicillamine     placebo      not randomized
                    (N=154)         (N=158)         (N=106)    
---------------------------------------------------------------
surv       269                                                 
   No            0.495  45/91    0.457   48/105   0.479  35/73 
   Yes           0.505  46/91    0.543   57/105   0.521  38/73 
sex        418                                                 
   male         0.097   15/154   0.133   21/158  0.075    8/106
   female       0.903  139/154   0.867  137/158  0.925   98/106
===============================================================

The recommended method for the hmisc transform to see the column summary is to use the option overall=TRUE.

> tangram(drug ~ surv + sex, pbc, collapse_single=FALSE, overall=TRUE)
===============================================================================
            N   D-penicillamine     placebo      not randomized     Overall    
                    (N=154)         (N=158)         (N=106)         (N=418)    
-------------------------------------------------------------------------------
surv       269                                                                 
   No            0.495  45/91    0.457   48/105   0.479  35/73   0.476  128/269
   Yes           0.505  46/91    0.543   57/105   0.521  38/73   0.524  141/269
sex        418                                                                 
   male         0.097   15/154   0.133   21/158  0.075    8/106  0.105   44/418
   female       0.903  139/154   0.867  137/158  0.925   98/106  0.895  374/418
===============================================================================

If one wants to change the default summary, this is a Categorical X Categorical variable, dispatched via the "hmisc" transform, lines 446-448 at the end of transform-hmisc.R. This results in the function summarize_chisq getting called. The line at 314

  table <- add_col(table, cell_style[['n']](sum(!is.na(row$data)), possible=length(row$data), ...))

Is the one you'd want to change. To override this tranform, create your own transform in the current environment by copying the existing. Then define whatever you'd like.

hmisc <- tangram::hmisc
hmisc[['Categorical']][['Categorical']] <- your_function_here

Then the new behavior would be used for every call.

However, the original 'N' are descriptive of the data in a consistent manner that handles multiple rows. The dplyr approach is probably the simpler, because now you're giving a description of data that has been processed to drop those with missing columns variables.

spgarbet commented 4 years ago

Oh, the row entries are generated using table and the useNA option is passed through.

> tangram(drug ~ surv + sex, pbc, collapse_single=FALSE, overall=TRUE, useNA="always")
====================================================================================
                 N   D-penicillamine     placebo      not randomized     Overall    
                         (N=154)         (N=158)         (N=106)         (N=418)    
------------------------------------------------------------------------------------
surv            269                                                                 
   No                0.292   45/154   0.304   48/158  0.330   35/106  0.306  128/418
   Yes               0.299   46/154   0.361   57/158  0.358   38/106  0.337  141/418
   Missing (%)       0.409   63/154   0.335   53/158  0.311   33/106  0.356  149/418
sex             418                                                                 
   male              0.097   15/154   0.133   21/158  0.075    8/106  0.105   44/418
   female            0.903  139/154   0.867  137/158  0.925   98/106  0.895  374/418
   Missing (%)       0.000    0/154   0.000    0/158  0.000    0/106  0.000    0/418
====================================================================================
bogie commented 4 years ago

I think I understand what you are saying, but this is not what I was trying to convey.

In my case I have a dataframe with 133 observations(patients). 122 of those have an Endpoint at closeout(sum(!is.na(datar)), however since I am grouping by Group_DM2_Non, I only have 108 observations over both columns of my table.

As such table <- add_col(table, cell_style[['n']](sum(!is.na(row$data)), possible=length(row$data), ...)) needs an option to only count observations that have both: !is.na(Group_DM2_Non) & !is.na(EndPoint)

You are expecting that the grouping value(e.g. drug) has no NA values, as such you only exclude NA values from row$data and dont check something like sum(!is.na(row[!is.na(category)]$data)

In my opinion the row N should never be larger than the sum of the columns, I think a overall column doesnt help, since that would still show Overall(N=118) but some rows will have N>118!

Example:

Patients %>%
    tangram(x=Group_DM2_Non + Group_DM2_Pre_Non ~ Age + Sex + Group + Outcome + EndPoint + ECMO + Admission_Type + HbA1C_DM2 + DaysTo_Admission + DaysTo_ICU + DaysTo_Intubation + FeverDays + OxygenDays + ICUDays + VentDays + ECMODays + HospDays + PronePos + CVVH  + Antibiotics,
            data=., id="Group_DM2_characteristics", include_p =TRUE, style="nejm", collapse=TRUE, capture_units=FALSE, transforms = publication, digits = 1, test=TRUE, caption = "Patient characteristics", overall=TRUE)
Patient characteristics
===========================================================================================================================================================
                                   N   Non-Diabetic   Diabetic     Overall    Test Statistic   N   Pre-Diabetic  Normoglycaemic   Overall    Test Statistic
                                          (N=75)       (N=43)      (N=118)                            (N=41)         (N=34)        (N=75)                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Age (years)                       133    64.3±1.6     70.7±1.7    65.9±1.2      P=0.02^3      133    65.6±1.8       62.7±2.8      65.9±1.2     P=0.82^3    
Sex : w                           133   26 (34·7%)   12 (27·9%)   38 (32·2%)    P=0.45^2      133   16 (39·0%)     10 (29·4%)    26 (34·7%)    P=0.38^2    
Group : ARDS                      133   40 (53·3%)   20 (46·5%)   60 (50·8%)    P=0.48^2      133   25 (61·0%)     15 (44·1%)    40 (53·3%)    P=0.15^2    
Status at close-out               133                                           P=0.27^2      133                                              P=0.06^2    
   non-Survivor                         19 (25·3%)   17 (39·5%)   36 (30·5%)                         9 (22·0%)     10 (29·4%)    19 (25·3%)                
   Survivor                             50 (66·7%)   23 (53·5%)   73 (61·9%)                        26 (63·4%)     24 (70·6%)    50 (66·7%)                
   Hospitalised                          6 (8·0%)     3 (7·0%)     9 (7·6%)                          6 (14·6%)      0 (0·0%)      6 (8·0%)                 
Endpoint at close-out : Survivor  124   50 (72·5%)   23 (57·5%)   73 (67·0%)    P=0.11^2      124   26 (74·3%)     24 (70·6%)    50 (72·5%)    P=0.73^2  
spgarbet commented 4 years ago

I understand what you are saying quite well. The problem is there has to be a consistent application of a principal. The issue is not as easy as it first appears.

Right now the N value of the row is "Number of defined row values". This has been the standard report from summaryM and this package for quite some time. Very straight forward.

What is requested is that the N value of the row only count if their are defined values of the columns. I can add new behavior by defining an interface that adds an option of some form. The question is design.

If I add a pass through of the exclude argument to table, it applies it to both columns and rows. One could use it to exclude factors as well. This however makes the column N depend on the row now, and this becomes inconsistent from row to row so this approach won't work and should probably not be exposed to a user. I've been asked for exclude before, but this could lead to some surprising behavior if not very carefully implemented. I am leery of this without some careful checks, specifically for the case when NA is specified as the exclude which is what you're asking about. It would work for dropping factors, but the potential exists for an inconsistent viewpoint if there were overlap in the names of factors across the tables.

Maybe allow it and add something like exclude="colNA" as a special possible option?

bogie commented 4 years ago

I don't see the problem in excluding column NA values from the row N. In my mind it is the expected behaviour simply because of the fact that in categorical x categorical and collapse=FALSE the number of observations in each column for each factor would add up to the row N.

I fixed it for myself by filtering with dplyr, but I would suggest that having this as an option would be nice! Thanks for your help so far!

spgarbet commented 4 years ago

If not done carefully it can lead to variable N values for the columns per row. That is the danger. This cannot be allowed via the interface, it must be consistent through all cases no matter how obscure. Right now it is consistent because it has a very simple rule that it runs. Relaxing and opening this interface up for different options is okay, but I can't have just your use case considered I have to consider all cases, and I should write tests for them as well.

I think I've got something that will work, have two argument row.exclude and col.exclude so they are applied separately. Next there has to be a means to flag values from the other variable. A character string not ideal. A vector logical could be interpreted as excluding those values, and a single logical meaning exclude missing col value. One could specify crazy things like,

row.exclude=list(c(NA, NaN, 3), "absent", is.na(someexternalvariablenotindataframe), TRUE)

This would exclude all row NA values, NaN, any numeric 3, any factor "absent" and another comb from something external that was not provided--and all missing column values. Similar logic can be applied to the column. Since row and column are evaluated independently, no inconsistency can arise. The interface is incredibly versatile, and covers some other requests I've had in a robust manner.

So the question is does row.exclude=TRUE seem a reasonable interface to mean that rows are excluded when the column isn't defined?

spgarbet commented 4 years ago

I've tried a variety of solutions to this, but they've all led to inconsistencies. I was ready to throw in the towel and give up on it when I hit upon a method that just might work, so I've not given up yet.

bogie commented 4 years ago

Your committment to this is remarkable, thanks! I had to rotate to an outpatient clinic for the last weeks so I am just now catching up on things.

In the past I found it hard to find documentation on optional function arguments for tangram, if you include something like row.exclude it has to be documented thoroughly.

I would still argue that this should not be optional but rather have it vice-versa to optionally fall back to the current behaviour. In my mind the row N for non-NA values should never be larger than the combined column N values. I know my colleagues were confused at first as well. My proposed feature would allow for people to recognize that some data sets are incomplete.

In the end it is up to you, for now I have fixed it by filtering with dplyr beforehand.

While you are at it, is there a simple interface to access table cell values by column/row names in order to reference it in the full text?

spgarbet commented 4 years ago

You're the third or fourth person who's asked me for it. So I figure it's a broader user need.

For documentation in the current release try ?tangram or ?hmisc.

I have some code written, I'm testing. It can lead to surprising behavior, so I'm working on allowing targeted exclusion criteria.

dplyr is still probably the best.

Right now table cell values are just numbered. It's a list of lists. The question becomes what would column/row names be. A single variable can make multiple rows/columns, so the variable name is insufficient. At the inception of the package there was a strict naming convention that was intended for result tracing and tying a number to the originating data set. This rarely got used, so it's fallen into disrepair, but it would be possible to do that.

spgarbet commented 4 years ago

This is still broken for data coming from the environment. I don't know how I'm going to do that, it's more involved, but will have to be supported for this enhancement to be complete.