mitre / sparklyr.nested

A sparklyr extension for nested data
Apache License 2.0
31 stars 4 forks source link

Convert spark dataframe having list<Values> to rows using sparklyR #23

Closed ppapasani1-rms closed 1 year ago

ppapasani1-rms commented 4 years ago

I have a spark data frame that has list(values) in each row. Is there a way to flatten the data by converting list of values to rows using sparklyr? Here is the sample data.

  id   colA     colB  colC
1: 1 list<> 4b,8b,2b list<>
2: 2 list<> 7b,2b,2b list<>

My output should look like this:

   id colA colB  colC
1:  1    1   4b FALSE
2:  1    2   8b FALSE
3:  1    3   2b FALSE
4:  2    1   7b FALSE
5:  2    2   2b FALSE
6:  2    3   2b FALSE

My data is of medium size(around 200M records). So I don't want to collect the data into R memory to perform this operation.

Here is the reproducible dataset

data.table(structure(list(id = list(1,2), 
               colA = list(list(1, 2, 3),list(1, 2, 3)), 
               colB = list(as.raw(c(0x4b, 0x8b, 0x2b)),as.raw(c(0x7b, 0x2b, 0x2b))), 
               colC = list(list(FALSE, FALSE, FALSE),list(FALSE, FALSE, FALSE))
              ),
          .Names = c("id", "colA", "colB", "colC"), 
          row.names = c(NA, -1L),
          class = c("data.frame","data.table")))