spgarbet / tangram

Table Grammar package for R
66 stars 3 forks source link

Binary variable inserts extra row #44

Closed kylerove closed 4 years ago

kylerove commented 5 years ago

Sorry another bug.

In this example I showed previously, the output for one of the binary variables (postop_mobilization_check) contains an extra row:

`table3 <- tangram(group ~ eras_score[0]

screen shot 2018-09-14 at 10 53 30 am

This seems to occur randomly and only occasionally. There is nothing about that variable that is different as far as I can see (type, factor, contents, etc).

spgarbet commented 5 years ago

Now that is a weird one. Can you send me some information on that variable? Specifically interested in

unique(allSingle$postop_mobilization_check)
class(allSingle$postop_mobilization_check)
type(allSingle$postop_mobilization_check)

Also all the [0] you're adding can be handled by adding digits=0 as an option.

kylerove commented 5 years ago

unique(allSingle$postop_mobilization_check) [1] 1 0 Levels: 0 1 class(allSingle$postop_mobilization_check) [1] "factor" typeof(allSingle$postop_mobilization_check) [1] "integer"

spgarbet commented 5 years ago

I think postop_excessdrainremoval_check doesn't have a label is the problem.

kylerove commented 5 years ago

Ah, I see, that row actually corresponds to that variable. It does have a label. The problem is perhaps that all the values are 1. Perhaps there is some bug preventing the label from coming over for that reason?

`unique(allSingle$postop_excessdrainremoval_check) [1] 1 Levels: 1

class(allSingle$postop_excessdrainremoval_check) [1] "factor"

typeof(allSingle$postop_excessdrainremoval_check) [1] "integer"

attributes(allSingle$postop_excessdrainremoval_check) $levels [1] "1" $class [1] "factor" $label [1] "Early removal of excess drains"

spgarbet commented 5 years ago
allSingle$postop_excessdrainremoval_check <- factor(
  allSingle$postop_excessdrainremoval_check,
  levels=c(0,1))

Will force it to have two levels. I don't have any code for dealing with a single level. I can add that. But I don't know what it should say really. I guess just having it get the label right would be enough.

kylerove commented 5 years ago

For factors, can you confirm that it uses the level rather than labels for individual factor levels? For example, I have a variable:

> attributes(recordsData$diagnosis_1)
$levels
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "99"

$class
[1] "factor"

$labels
 [1] "BK cystitis"                       "cerebral palsy"                    "cloacal anomaly"                   "cloacal exstrophy"                
 [5] "epispadias (primary)"              "extrophy-epispadias complex"       "imperforate anus"                  "non-neurogenic neurogenic bladder"
 [9] "open bladder neck"                 "persistent UG sinus"               "posterior urethral valves"         "rhabdomyosarcoma"                 
[13] "sacral agenesis"                   "sacrococcygeal teratoma"           "spina bifida"                      "spinal cord injury"               
[17] "spinal cord tumor"                 "spinal muscular atrophy"           "tethered cord"                     "UG sinus"                         
[21] "VACTERL"                           "other"                             "unknown"                          

$label
[1] "Diagnosis"

It outputs the level (1-22,99) in the table rather than the factor level label. Previously, I was allowing REDCap output to set the labels as the level, but it became harder to manipulate than with just the integers. Curious if tangram can put the level labels into the table rather than the level itself. Is something I can specify with custom cell function?

spgarbet commented 5 years ago

It should be using the labels. I assume you're using the default transform. Is this correct?

kylerove commented 5 years ago

Correct.

Here is sample data for a more simple factor variable with 3 levels:

> recordsData$sex
  [1] 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0
 [84] 0 0 1 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0
[167] 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1
attr(,"label")
[1] Sex
attr(,"labels")
[1] Female    Male      DSD/Other
Levels: 0 1 2
>
> attributes(recordsData$sex)
$levels
[1] "0" "1" "2"

$class
[1] "factor"

$label
[1] "Sex"

$labels
[1] "Female"    "Male"      "DSD/Other"
>
> testTable <- tangram(group ~ sex, data=recordsData)
Warning message:
In tangram.formula(group ~ sex, data = recordsData) :
  tangram() will require unique id to be specified in the future
>
> testTable
=======================================
       N        1              3       
              (N=89)        (N=101)    
---------------------------------------
Sex   190                              
   0       0.539  48/89  0.525   53/101
   1       0.461  41/89  0.475   48/101
   2       0.000   0/89  0.000    0/101
=======================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.
spgarbet commented 5 years ago

I see the issue. It's a factor, but it's using labels via Hmisc and not the internal R labels for a factor.

spgarbet commented 5 years ago

I can't figure out how you constructed sex. What commands did you use?

> x <- data.frame(sex   = factor(sample(0:2, 100, TRUE), labels=c("Female", "Male", "DSD/Other")),
+                 group = factor(sample(c(1,3), 100, TRUE)))
> tangram(group ~ sex, x)
=============================================
               N        1             3      
                      (N=45)        (N=55)   
---------------------------------------------
sex           100                            
   Female          0.244  11/45  0.273  15/55
   Male            0.400  18/45  0.309  17/55
   DSD/Other       0.356  16/45  0.418  23/55
=============================================
spgarbet commented 5 years ago
> y <- factor(sample(0:2, 100, TRUE), labels=c("Female", "Male", "DSD/Other"))
> y
  [1] Female    Male      Female    DSD/Other Male      DSD/Other DSD/Other Male      Female   
 [10] Female    DSD/Other Female    Male      Female    Female    Female    DSD/Other Male     
 [19] Female    DSD/Other DSD/Other Male      DSD/Other Male      Female    Female    Male     
 [28] DSD/Other Male      Male      DSD/Other DSD/Other Male      Male      DSD/Other Female   
 [37] DSD/Other DSD/Other DSD/Other DSD/Other Male      Female    Male      DSD/Other DSD/Other
 [46] DSD/Other Male      Female    DSD/Other Male      Female    Male      DSD/Other Female   
 [55] Male      DSD/Other DSD/Other Female    DSD/Other Female    Female    Female    DSD/Other
 [64] DSD/Other Female    Female    Female    DSD/Other DSD/Other Female    DSD/Other Male     
 [73] Female    Male      Female    DSD/Other Female    Male      DSD/Other Male      Male     
 [82] DSD/Other Female    DSD/Other Female    Male      DSD/Other Male      Female    DSD/Other
 [91] Male      Male      Male      Female    Male      Male      DSD/Other Female    Male     
[100] Female   
Levels: Female Male DSD/Other
> attributes(y)
$levels
[1] "Female"    "Male"      "DSD/Other"

$class
[1] "factor"
kylerove commented 5 years ago

I'm pulling in the dataset directly from REDCap using https://github.com/nutterb/redcapAPI.

When it pulls it in, this API has option to set factors as such, but I was having issues with the code. So when it pulls it in, I just get the raw data (recordsData$sex is something in 0:2). After that I run:

recordsData$sex = factor(recordsData$sex, levels=attributes(recordsData$sex)$redcapLevels)

Rather than run as_factor(), this ensures all levels available in our REDCap database are set. This avoids situation where empty level would otherwise be omitted. Then I run:

attr(recordsData$sex, "label") = "Sex"

This sets the high-level label for the whole covariate.

attributes(recordsData$sex)$labels = str_remove(attributes(records$sex)$redcapLabels,"^[0-9]*. ")

Then, I pull in the labels for the levels. Our REDCap databases usually prefix each factor label with the corresponding raw data value like 1="1. The label for #1". Rather than manipulate the data with the full label, it is easier to keep the raw values (integers) and use those.

spgarbet commented 5 years ago

Okay. I'll look through the REDCap code and see if I can understand just what it's doing.

kylerove commented 5 years ago

To be honest, the issue I was having is that having the redcapAPI set factors like:

recordsData <- exportRecords(rcon, factors=TRUE, labels=FALSE, dates=FALSE, survey=FALSE, dag=TRUE, batch.size=-1)

was doing what you created. It creates a factor where all the levels = labels. I have been preferring to do my analysis without the labels and just the raw integer values. So I call the above with factors=FALSE set.

spgarbet commented 5 years ago

For the last line of your code try "levels":

> x <- factor(sample(1:2, 10, TRUE), levels=1:2)
> x
 [1] 2 2 1 1 1 1 2 2 1 2
Levels: 1 2
> levels(x) <- c("On", "Off")
> x
 [1] Off Off On  On  On  On  Off Off On  Off
Levels: On Off
kylerove commented 5 years ago

Yeah, that overwrites the integer values, which are easier to manipulate and compare than some of our really long labels coming from the database.

kylerove commented 5 years ago

Maybe I'm making it overly complicated for my purposes. :) That is very possible.

spgarbet commented 5 years ago

I can't find how any other packages use "labels" on a factor. Setting it directly seems to not construct a proper factor. However, that doesn't preclude using it in the package. It would just be a little off the main path.

kylerove commented 5 years ago

This is a little surprising if you think about it. We can set labels for variables and don’t force people to use raw variable names in the output. Why not have the same abstraction for factor levels? Descriptive labels are sometime verbose and gum up the code if I’m constantly comparing long strings to filter/mutate the data.

Kyle

On Apr 23, 2019, at 8:08 AM, Shawn Garbett notifications@github.com wrote:

I can't find how any other packages use "labels" on a factor. Setting it directly seems to not construct a proper factor. However, that doesn't preclude using it in the package. It would just be a little off the main path.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

spgarbet commented 5 years ago

I've never pushed against this aspect of factors in R. This is somewhat disappointing in how it work in base as the type information is available.

However, this works:

> x <- factor(0:2, labels=c("One", "Two", "Three"))
> x
[1] One   Two   Three
Levels: One Two Three
> x[1] == "One"
[1] TRUE
> 1 == x[1]
[1] FALSE
> x[1] == 1
[1] FALSE
> as.numeric(x[1]) == 1
[1] TRUE
> 
> `==` <- function(e1, e2)
+ {
+   ifelse(is.factor(e1) & is.numeric(e2),
+          base::`==`(as.numeric(e1), e2),
+          ifelse(is.factor(e2) & is.numeric(e1),
+                 base::`==`(as.numeric(e2), e1),
+                 base::`==`(e1, e2))
+         )
+ }
> 
> x[1] == "One"
[1] TRUE
> 1 == x[1]
[1] TRUE
> x[1] == 1
[1] TRUE
> as.numeric(x[1]) == 1
[1] TRUE
kylerove commented 5 years ago
allSingle$postop_excessdrainremoval_check <- factor(
  allSingle$postop_excessdrainremoval_check,
  levels=c(0,1))

Will force it to have two levels. I don't have any code for dealing with a single level. I can add that. But I don't know what it should say really. I guess just having it get the label right would be enough.

This bug is still present. I create a fair number of descriptive tables and just want it to say 0% with the correct label. It still pulls in the opposite value (0 = 100% when it should match other reported two-level variables and give % of the second level, aka 1=0%).

I worked around the labels for categorical by performing all logic/data grunging up front and then re-leveling the variables to get the correct labels.

spgarbet commented 5 years ago

I'll look into it again.

spgarbet commented 5 years ago
library(tangram)

test <- data.frame(
  v = factor(rep('v', 100), levels=c('u','v')),
  x = rnorm(100),
)

tangram(v ~ x, test)

Produces

===================================
    N     u             v          
        (N=0)        (N=100)       
-----------------------------------
x  100         -0.685 *0.087* 0.699
===================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.

and

tangram(v ~ x, test, collapse_single=TRUE)

produces

================================
        N            x          
                  (N=100)       
--------------------------------
v : v  100  -0.685 *0.087* 0.699
================================
N is the number of non-missing value. ^1 Kruskal-Wallis. ^2 Pearson. ^3 Wilcoxon.

What is it that you expect? Or what is the comparison being done?

kylerove commented 4 years ago

Sorry for the long delay. It is just a comparison of the incidence of a particular binary variable:

==================================================================================================
                                                   1                               3              
                                              26 patients                     13 patients         
--------------------------------------------------------------------------------------------------
Dexmedetomidine : 1                              2 (8%)                        11 (85%)
Received opioids in PACU : 1                    14 (54%)                        2 (15%)           
   0                                           26 (100%)                       13 (100%)  
==================================================================================================

For whatever reason when the variable is all 1 or all 0, it prints the level (0 or 1) instead of label : level

In the example above, I would expect the row to appear like this:

==================================================================================================
                                                   1                               3              
                                              26 patients                     13 patients         
--------------------------------------------------------------------------------------------------
Dexmedetomidine : 1                              2 (8%)                        11 (85%)
Received opioids in PACU : 1                    14 (54%)                        2 (15%)           
Received ketamine : 1                            0 (0%)                         0 (0%)  
==================================================================================================
kylerove commented 4 years ago

I think you can close this. It is functioning as intended. That variable did not have two specified levels. Once I corrected that, it shows fine.