tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

FR: promote() to create new variable from a list column #341

Closed jennybc closed 5 years ago

jennybc commented 7 years ago

Putting on the radar here at @hadley's suggestion.

What about a function promote() that can create a simplified variable from info extracted from a list column?

Example:

library(tidyverse)
x <- tibble(
  character = c("Toothless", "Dory"),
  metadata = list(
    list(species = "dragon", color = "black",
         films = c("How to Train Your Dragon", "How to Train Your Dragon 2")),
    list(species = "clownfish", color = "blue",
         films = c("Finding Nemo", "Finding Dory"))
  )
)

## hypothetical call to promote()
## x %>% promote(metadata, species)
## indicative result
x %>%
  mutate(species = simplify(map(metadata, "species"))) %>% 
  select(character, species, everything())
#> # A tibble: 2 x 3
#>   character   species   metadata
#>       <chr>     <chr>     <list>
#> 1 Toothless    dragon <list [3]>
#> 2      Dory clownfish <list [3]>

What friction would promote() remove? The auto-simplification and "putting the new variable in front of the old instead of at the end where I can never see it".

Related to https://github.com/tidyverse/purrr/issues/336. The new capability of purrr::pluck() also seems interesting in this context.

In my real life, both issues are motivated by dealing with tibblized JSON from an API, where I have one row per item and I'm dragging around a list-column of metadata.

hadley commented 7 years ago

Should it have an option to remove that component from the list? i.e.

x %>% promote(metadata, species = "species")
#> # A tibble: 2 x 3
#>   character   species   metadata
#>       <chr>     <chr>     <list>
#> 1 Toothless    dragon <list [2]>
#> 2      Dory clownfish <list [2]>

Because species has been moved out of metadata?

jennybc commented 7 years ago

Probably. I think in my Game of Thrones character/book stuff I had to do exactly that. Seems good idea re: DRY principle. Then I guess you need demote(), so roundtrips are possible.

jennybc commented 7 years ago

More notes re: the conversation. Would you want to be able to promote multiple variables at once? In the limit, you are just transposing + simplifying + column binding, I suppose.

hadley commented 7 years ago

And do you want to be able to specify types like in the map functions?

jennybc commented 7 years ago

In my original fantasy, no, simplify() is simplifying if possible and giving you a list-column otherwise. But I am not 100% convicted about this aspect.

lionel- commented 7 years ago

Hmm it feels this should have been how unnest() works, i.e. unnesting only one column at a time along with a specification of which columns to unnest. Would it make sense to call this function unnest_at() and get _if() and _all() variants? This would be slightly inconsistent with unnest() though because the selections would apply inside the list-column instead of selecting list-columns to unnest.

If we used mutate semantics instead of select, we could be explicit by using the vector constructors from rlang:

promote(df, list_col, Species = chr(Species))

But it wouldn't work if you want to be explicit about unnesting to another list-column. Unless we only try to simplify bare symbols?

promote(df, listcol, other_listcol = Species)            # Simplifies
promote(df, listcol, other_listcol = identity(Species))  # Doesn't simplify

With _at() variant you could still provide a selection:

promote_at(df, listcol, vars(everything()), funs(chr))
jennybc commented 7 years ago

I also started to have an eerie feeling re: connections to unnest(). One possible difference: I don't see promote() ever causing 1 row to be expanded into n rows. You are not altering your definition of a row or observation, whereas with unnest() you do.

lionel- commented 7 years ago

I think promote() could (should?) work on any list of rectangular lists, including data frames, and then you could end up expanding the number of rows.

Regarding "putting the new variable in front of the old instead of at the end where I can never see it", there is a tension with the idiom that the variable last created is placed in the last position. This allows pull() to be called without argument to retrieve that variable. It's not clear what is best. Should the print method always display the last column?

hadley commented 6 years ago

I think promote()/demote() and nest()/unnest() are related but different - promote() allows you to pull components out of a nested list one at a time; unnest() requires you splat the whole thing in one go.

hadley commented 6 years ago

Some more imaginary examples:

library(tibble)

df <- tribble(
  ~x, ~y,
  1,  list(a = 1:3, b = list(X = 3, Y = 5), c = 5),
  2,  list(a = 4,   b = list(X = 1, Y = 5), c = 7)
)

# Single value is unambiguous
# df %>% promote(y, "c")
tribble(
  ~x, ~y,                                    ~c,
  1,  list(a = 1:3, b = list(X = 3, Y = 5)), 5, 
  2,  list(a = 2,  b = list(X = 1, Y = 5)),  7
)

# Named vector forms columns
# df %>% promote(y, "b")
tribble(
  ~x, ~y,                   ~X, ~Y,
  1,  list(a = 1:3, c = 5), 3,  5, 
  2,  list(a = 2, c = 7),   1,  5
)

# Unnamed vector forms rows
# df %>% promote(a, "b")
tribble(
  ~x, ~y,                                   ~a,
  1,  list(b = list(X = 3, Y = 5), c = 5),  1,
  1,  list(b = list(X = 3, Y = 5), c = 5),  2,
  1,  list(b = list(X = 3, Y = 5), c = 5),  3,
  2,  list(b = list(X = 1, Y = 5), c = 7),  1
)

I think these are basically a wrapper around a mutate (which uses pluck() to hoist the element up into the data frame itself, and optionally removes from the list), and unnest (which will soon come in unnest_row() and unnest_col() variants so would have matching promote_col() and promote_row()).

colearendt commented 6 years ago

My familiarity with list columns comes largely from tibblized JSON data, as well. However, I really liked the approach taken by the tidyjson package (which recently got booted from CRAN). It didn't actually have a list column, but it acted like it did by throwing an ATTR into the tibble as a list column before every operation.

In general, I think it would be more clear to use something akin to the gather and spread verbs, but relative to the list column instead of the tibble. I.e. in @hadley's first and second examples, behavior seems more like spread, whereas the third is more of a gather. I also conceived of readr-like functionality where columns can either be selected (with types) manually or automatically (i.e. don't select a key and all values/types will be inferred, print the schema as a note, and allow for manipulation).

It might make sense to pull this functionality into a separate package (I like @hadley's idea of tidytree). Some examples I am hoping illustrate my idea:

Spread-like behavior. tidyjson had a development way to spread_all and recurse through keys (it would not spread a list-column like I do below) that was helpful. Gathering an array was your only option for dealing with arrays, although a spread option had been proposed.

tree <- tibble::data_frame(                     
key = c(1,2)                                    
, list_col=list(                                
list("a"=c(1,2)                                 
, "b"=c(3,4))                                   
, list("a"=c(5,6)                               
,"b"=c(7,8))                                    
)                                               
)                                               
print(tree)                                     
#> # A tibble: 2 x 2
#>     key   list_col
#>   <dbl>     <list>
#> 1     1 <list [2]>
#> 2     2 <list [2]>

# tree %>% spread_tree(list_col,levels=1)       
# Parsed with column specification:             
#  cols(                                        
#    a = col_list(),                            
#    b = col_list()                             
#  )                                            
output_level1 <- tibble::data_frame(            
key=c(1,2)                                      
, a=list(c(1,2),c(5,6))                         
, b=list(c(3,4),c(7,8))                         
)                                               
print(output_level1)                            
#> # A tibble: 2 x 3
#>     key         a         b
#>   <dbl>    <list>    <list>
#> 1     1 <dbl [2]> <dbl [2]>
#> 2     2 <dbl [2]> <dbl [2]>

# output_level1 %>% spread_tree(levels=1) # hits all list columns?
#  Parsed with column specification:            
#    cols(                                      
#      key = col_integer(),                     
#      a_1 = col_integer(),                     
#      a_2 = col_integer(),                     
#      b_1 = col_integer(),                     
#      b_2 = col_integer()                      
#    )                                          
output_level2 <- tibble::data_frame(            
key=c(1,2)                                      
, a_1=c(1,5)                                    
, a_2=c(2,6)                                    
, b_1=c(3,7)                                    
, b_2=c(4,8)                                    
)                                               
print(output_level2)                            
#> # A tibble: 2 x 5
#>     key   a_1   a_2   b_1   b_2
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     2     3     4
#> 2     2     5     6     7     8

Gather-like behavior:

tree <- tibble::data_frame(key=c(1,2)       
, list_col=list(                    
list("a","b","c")                   
,list("d","e","f")                  
)                                   
)                                   
#> # A tibble: 2 x 2
#>     key   list_col
#>   <dbl>     <list>
#> 1     1 <list [3]>
#> 2     2 <list [3]>

# tree %>% gather_tree(list_col)            
# Parsed with column specification: 
#  cols(                            
#    output = col_character()       
#  )                                
tibble::data_frame(                 
key=c(rep(1,3),rep(2,3))            
, output= c("a","b","c","d","e","f")
)                                   
#> # A tibble: 6 x 2
#>     key output
#>   <dbl>  <chr>
#> 1     1      a
#> 2     1      b
#> 3     1      c
#> 4     2      d
#> 5     2      e
#> 6     2      f

I am glossing over several tricky things here - what inferences to make about names when not provided, how to enable the user to control those inferences, etc.

One last tidbit that I thought tidyjson did well - it tried to "guarantee" your output state, if you defined what you wanted. I.e. if you spread_tree(list_col,output=col_character()), it would try to coerce a col_character even if the underlying list structure changed (i.e. the JSON data changed over time).

I would love to see this list-column implementation expanded to deal with XML, JSON, etc. in a generalized way. Curious to hear your thoughts!

colearendt commented 6 years ago

Relates to #418 , I believe

hadley commented 5 years ago

Another potential application: https://sharla.party/posts/discog-purrr/

hadley commented 5 years ago

Latest thoughts:

All together, I think can means we can be more precise about the use of hoist(): it's designed to reach into a list-col and pull out selected components. It's a wrapper around the common mutate() + map_() + pluck() pattern with the advantages that it can move the data (rather than copying it), and it can use vctrs style type resolution.

The main question is the interface. Should it take a column name, and the name of the components inside that column to hoist?

df %>% hoist(metadata, "species")
df %>% hoist(metadata, "films")
df %>% hoist(metadata, c("films", "species", "color"))

Or should it take a set of named pluck expressions?

df %>% hoist(species = c("metadata", "species"), first_film = list("metadata", "film", 1L))

The first form is less flexible, but forces a step-by-step approach to dealing with deeply nested columns that I think might be helpful (i.e. you don't need to discover the pluck expression up front). It's also nice that existing columns can be referred to without quotes, whereas the new columns require quotes.

I have convinced myself that the simpler form is better, so please speak up now if you see an obvious downside!

lionel- commented 5 years ago

I think deep plucking might be useful when dealing with web metadata. Also the plucked objects are not existing columns but they are still existing objects, so the character vector syntax is less obvious. Maybe with the pluck syntax it is more obvious.

For these two reasons, I think I prefer the second form. Also it seems more natural to me to define new columns with parameter syntax, as in mutate().

hadley commented 5 years ago

@lionel- You can still use mutate() + map() for that. i.e.

df %>% hoist(
  species = c("metadata", "species"), 
  first_film = list("metadata", "film", 1L)
)

Is equivalent to (and not much shorter than)

df %>% mutate(
  species = map_c(metadata, "species"),
  first_film = map_c(metadata, list("film", 1))
)

(assuming a map_c() function with vctrs semantics)

jennybc commented 5 years ago

df %>% hoist(metadata, c("films", "species", "color"))

I really wince at this, because it re-aggravates people's existing confusion about

  1. what's going to happen when you map a character vector over a list
  2. the consequence of providing "loose parts" vs. bundling stuff via c() or list()

I'll be back with some examples, if the conversation doesn't move past me too quickly.

jennybc commented 5 years ago

Here's my example for point 1. re: potential to aggravate existing confusion for people mastering "how to work with lists and list-cols".

library(purrr)
library(repurrrsive)

Let’s say you’re interested in multiple fields for each GoT character. How can something that feels so right be this wrong?

map(got_chars[1:2], c("name", "culture", "born"))
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL

No, silly, you’ve gap to map [ in this case.

map(got_chars[1:2], `[`, c("name", "culture", "born"))
#> [[1]]
#> [[1]]$name
#> [1] "Theon Greyjoy"
#> 
#> [[1]]$culture
#> [1] "Ironborn"
#> 
#> [[1]]$born
#> [1] "In 278 AC or 279 AC, at Pyke"
#> 
#> 
#> [[2]]
#> [[2]]$name
#> [1] "Tyrion Lannister"
#> 
#> [[2]]$culture
#> [1] ""
#> 
#> [[2]]$born
#> [1] "In 273 AC, at Casterly Rock"

But the proposed hoist() syntax creates a legitimate use for that first syntax and, I think, could make it harder for people to get all this straight in their mind.

jennybc commented 5 years ago

Here's my example for point 2 re: potential to aggravate existing confusion for people mastering "how to work with lists and list-cols".

library(purrr)
library(repurrrsive)

# setup, nothing to see here
names(gh_repos) <- map_chr(gh_repos, list(1, "owner", "login"))

Providing indexing info as “loose parts” does not error, but this is not correct.

map(gh_repos, 4, "owner", "login")
#> $gaborcsardi
#> $gaborcsardi$id
#> [1] 34924886
#> 
#> $gaborcsardi$name
#> [1] "baseimports"
#> 
#>  overwhelming amount of output follows ...

What if we pack indexing info via c()? No error but still wrong.

map(gh_repos, c(4, "owner", "login"))
#> $gaborcsardi
#> NULL
#> 
#> $jennybc
#> NULL
#> 
#> $jtleek
#> NULL
#> 
#> $juliasilge
#> NULL
#> 
#> $leeper
#> NULL
#> 
#> $masalmon
#> NULL

What if we pack indexing info via list()? Bingo!

map(gh_repos, list(4, "owner", "login"))
#> $gaborcsardi
#> [1] "gaborcsardi"
#> 
#> $jennybc
#> [1] "jennybc"
#> 
#> $jtleek
#> [1] "jtleek"
#> 
#> $juliasilge
#> [1] "juliasilge"
#> 
#> $leeper
#> [1] "leeper"
#> 
#> $masalmon
#> [1] "masalmon"

Created on 2019-04-24 by the reprex package (v0.2.1.9000)

jennybc commented 5 years ago

I realize this conversation is about hoist(), which operates in the context of a tibble that hosts a list-column, not about purrr. But it's very tied up in people being competent with lists in general and with plucking, specifically.

I think it's important to view the hoist() syntax in that context.

hadley commented 5 years ago

How about this compromise between the two forms? We adhere closer to pluck syntax, but allow you to apply it to only a single-list col at a time (hence considerably reducing duplication):

df %>% hoist(metadata,
  species = "species",
  first_film = list("films", 1L)
)

(I've also decided it's easiest to leave the list column as is; attempting to combine removal with pluck semantics is too complicated)