ucl-ihi / CodeClub

Repository for all things IHI CodeClub
https://ucl-ihi.github.io/CodeClub/
3 stars 5 forks source link

apply in R #9

Closed jasminegr closed 5 years ago

jasminegr commented 5 years ago

Hi,

I had this issue last week.

I wanted to get the counts for events (1) and no-events (0) from a dataset. In the first data table, the apply function returns a list (not what I want), whereas the second data table gives us the counts of events and no-events for each ICD code (what I want).

Data table 1
eid ICD1    ICD2    ICD3
1   0   0   1
2   1   0   1
3   1   1   1
4   0   1   1
5   0   1   1
6   1   0   1
7   1   1   1
8   0   0   1
9   0   0   1
10  0   1   1
Data table 2
eid ICD1    ICD2    ICD3
1   0   0   1
2   1   0   0
3   1   1   1
4   0   1   0
5   0   1   0
6   1   0   0
7   1   1   1
8   0   0   0
9   0   0   0
10  0   1   1
#Code club: apply

rm(list = ls())
if(!require(data.table)) {install.packages("data.table", dependencies = T); library(data.table)}

# Data table 1
dt <- read.delim("apply_dt.txt")
events <- apply(dt[,2:dim(dt)[2]] , 2, table) 
events
# returns a list

# Data table 2
dt2 <- read.delim("apply_dt2.txt")
events2 <- apply(dt2[,2:dim(dt2)[2]] , 2, table)
events2
# returns counts

Does anyone know why?

drhodesbrc commented 5 years ago

what does str() return for each of the dfs?

also could just try library(tidyverse) dt %>% select(starts_with('ICD')) %>% gather() %>% count(value)

alhenry commented 5 years ago

I think this has something to do with the default behaviour of apply and table.

Consider the following example:

dt2 <- dt <- data.frame(
  eid = 1:10,
  ICD1 = c(0,1,1,0,0,1,1,0,0,0),
  ICD2 = c(0,0,1,1,1,0,1,0,0,1),
  ICD3 = rep(1,10)
)

dt2$ICD3 <- c(1,0,1,0,0,0,1,0,0,1)

table(dt$ICD3)
#> 
#>  1 
#> 10
str(table(dt$ICD3))
#>  'table' int [1(1d)] 10
#>  - attr(*, "dimnames")=List of 1
#>   ..$ : chr "1"

table(dt2$ICD3)
#> 
#> 0 1 
#> 6 4
str(table(dt2$ICD3))
#>  'table' int [1:2(1d)] 6 4
#>  - attr(*, "dimnames")=List of 1
#>   ..$ : chr [1:2] "0" "1"

As we can see, table(dt$ICD3) returned a 1-dimension array because all observations have values of 1, whereas table(dt$ICD3) (and other ICD codes) returned 2-dimension arrays because there are at least 1 values of 0 and 1.

I think apply returned a list for dt because the output were arrays of different dimensions, whereas for dt2 it can merge the output nicely into a 2-dimension array.

One workaround for this is to first convert the ICD 10 code variable to factor:

apply(dt[,2:dim(dt)[2]], 2, function(x) table(factor(x, levels = c(0,1)))) 
#>   ICD1 ICD2 ICD3
#> 0    6    5    0
#> 1    4    5   10

apply(dt2[,2:dim(dt2)[2]], 2, function(x) table(factor(x, levels = c(0,1)))) 
#>   ICD1 ICD2 ICD3
#> 0    6    5    6
#> 1    4    5    4

Or the tidyverse solution as suggested by Dan:

library(tidyverse)
count_event <- function(dt){
  dt %>% select(starts_with('ICD')) %>% gather %>% count(key, value) %>% spread(key, n, fill = 0)
}
count_event(dt)
#> # A tibble: 2 x 4
#>   value  ICD1  ICD2  ICD3
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     0     6     5     0
#> 2     1     4     5    10
count_event(dt2)
#> # A tibble: 2 x 4
#>   value  ICD1  ICD2  ICD3
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     0     6     5     6
#> 2     1     4     5     4

Hope this helps

hilge commented 5 years ago

Another solution - only using base R

events <-  as.data.frame(apply(dt[,2:dim(dt)[2]] , 2, function(X){c(sum(X==0),sum(X==1))}))
anoopshah commented 5 years ago

A question is what format do you want the result in depending on different potential inputs? Do you want to just add up the 0s and 1s?

sapply(dt[2:4], function(x) c(sum(x == 0), sum(x == 1)))     
     ICD1 ICD2 ICD3
[1,]    6    5    0
[2,]    4    5   10