tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

unnest multiple columns at once #44

Closed momeara closed 9 years ago

momeara commented 9 years ago

I'd like unnest to support unnesting multiple columns at once. For example,

x <- data_frame(
  a=c("a:b", "c"), b=c("1:2", "3"), c=c(11,22)) %>%
  transform(
    a = strsplit(a,":"),
    b = strsplit(b,":")) %>%
  unnest(a, b)

would produce

  a b  c
1 a 1 11
2 b 2 11
3 c 3 22

As a real world example where this comes up, the HGNC allows extracting gene family ids and descriptions, but it organizes them like this:

       hgnc_id                hgnc_gene_name hgnc_gene_family_ids                         hgnc_gene_family_descriptions
 1: HGNC:10006    Rh-associated glycoprotein  CD\tbloodgroup\tSLC   CD molecules\tBlood group antigens\tSolute carriers
 2: HGNC:10008 Rh blood group, CcEe antigens       CD\tbloodgroup                    CD molecules\tBlood group antigens
 3: HGNC:10009     Rh blood group, D antigen       CD\tbloodgroup                    CD molecules\tBlood group antigens
 4:  HGNC:1001         B-cell CLL/lymphoma 6      ZBTB\tZNF\tBTBD -\tZinc fingers, C2H2-type\tBTB/POZ domain containing

I'd like it unnest hgnc_gene_family_ids and hgnc_gene_family_descriptions simultaneously:

       hgnc_id                hgnc_gene_name hgnc_gene_family_ids hgnc_gene_family_descriptions
 1  HGNC:10006    Rh-associated glycoprotein                   CD                  CD molecules
 2  HGNC:10006    Rh-associated glycoprotein           bloodgroup          Blood group antigens
 3  HGNC:10006    Rh-associated glycoprotein                  SLC               Solute carriers
 4  HGNC:10008 Rh blood group, CcEe antigens                   CD                  CD molecules
 5  HGNC:10008 Rh blood group, CcEe antigens           bloodgroup          Blood group antigens
 6  HGNC:10009     Rh blood group, D antigen                   CD                  CD molecules
 7  HGNC:10009     Rh blood group, D antigen           bloodgroup          Blood group antigens
 8   HGNC:1001         B-cell CLL/lymphoma 6                 ZBTB                             -
 9   HGNC:1001         B-cell CLL/lymphoma 6                  ZNF       Zinc fingers, C2H2-type
 10  HGNC:1001         B-cell CLL/lymphoma 6                 BTBD     BTB/POZ domain containing

as a preliminary implementation, I have this

unnest <- function (data, cols){
    if(length(cols) > 1) {
       nested <- data[,cols]
       unnested <- apply(data[,cols], 2, function(x) list(unlist(x)))
       n <- lapply(nested,                                                                                                                                                                                      
           function(nested_col) vapply(nested_col, length, numeric(1)))
       if(length(unique(n)) != 1) {
           stop("nested columns must have the same number of elements for in each cell")
       }
       data <- data[rep(1:nrow(data), n[[1]]),]
       which_cols <- which(names(data) %in% cols)

       for(i in 1:length(cols)){
           data[, which_cols[i] ] <- unnested[[i]]
       }
       rownames(data) <- NULL
       return(data)
    } else {
       nested <- data[[cols]]
       unnested <- list(unlist(nested))
       names(unnested) <- cols
       n <- vapply(nested, length, numeric(1))
       rest <- data[rep(1:nrow(data), n), setdiff(names(data), cols),
           drop = FALSE]
       rownames(rest) <- NULL
       return(tidyr:::append_df(rest, unnested, which(names(data) == cols) - 1))
    }
}

If this looks like something that would be generally useful, I'd be happy to make a pull request that fits it into the package.

hadley commented 9 years ago

What would you expect this to do?

data_frame(
  a = c("a:b", "c"), 
  b = c("1:2:3", "3"), 
  c = c(11,22)
) %>%
  transform(
    a = strsplit(a,":"),
    b = strsplit(b,":")
  )
  %>%
  unnest(a, b)
momeara commented 9 years ago

Either giving an error or filling in with NA values like this:

a  b c
a  1 11
b  2 11
NA 3 11
c  3 22