tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

Preserve missing rows when unnesting #358

Closed leungi closed 5 years ago

leungi commented 7 years ago
Hi, Supposed tibble is as such (columns separated by ' | '): index | text | polarity | polarity_confidence | aspects 1 | blah1 | positive | 0.579939 | list() 2 | blah2 | negative | 0.693546 | list() 3 | blah3 | negative | 0.676733 | list() 4 | blah4 | positive | 0.756442 | list() 5 | blah5 | positive | 0.815249 | list() 6 | blah6 | positive | 0.72212 | list() 7 | blah7 | negative | 0.808398 | list(a = value, b = value, c = value) 8 | blah8 | negative | 0.63281 | list() 9 | blah9 | negative | 0.709047 | list() 10 | blah10 | negative | 0.912631 | list() 11 | blah11 | negative | 0.752882 | list(a = value, b = value, c = value) Issue: tibble %>% unnest(aspects) ##will drop every row except from 7 and 11 (i.e. those with non-empty list), '.drop = FALSE' doesn't help My workaround currently is as follow: 1) by row, determine if list is empty (using length()) 2) if list is empty, sub with dummy non-empty list (using if_else) 3) then unnest Workaround code: tibble %>% mutate(listLength = map_int(aspects, length)) %>% mutate(aspects = if_else(listLength <= 0, list(data.frame("NA")), aspects)) %>% unnest(aspects) Desired output: index | text | polarity | polarity_confidence | a | b | c 1 | blah1 | positive | 0.579939 | NA | NA | NA 2 | blah2 | negative | 0.693546 | NA | NA | NA 3 | blah3 | negative | 0.676733 | NA | NA | NA 4 | blah4 | positive | 0.756442 | NA | NA | NA 5 | blah5 | positive | 0.815249 | NA | NA | NA 6 | blah6 | positive | 0.72212 | NA | NA | NA 7 | blah7 | negative | 0.808398 | value | value | value 8 | blah8 | negative | 0.63281 | NA | NA | NA 9 | blah9 | negative | 0.709047 | NA | NA | NA 10 | blah10 | negative | 0.912631 | NA | NA | NA 11 | blah11 | negative | 0.752882 | value | value | value Am I missing something? Look FW to insights. Thanks in advance.
hadley commented 6 years ago

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

markdly commented 6 years ago

Adding a minimal reprex based on my understanding of OP issue. I think this is also related to #316.

I conceptually think of unnest as something which results in more rows/columns than the tibble provided while nest results in fewer rows/columns. Perhaps this is why these issues have been raised as losing rows during an unnest might be counter intuitive for some users (myself included) even though unnest is working as documented.

I think the desired result for both this issue and #316 is a dplyr::left_join of the non-list columns being unnested combined with the unnest results as shown in the workaround below.

library(dplyr)
library(tidyr)
library(purrr)
df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

# Row with empty tibble has been removed
df %>% unnest()
#> # A tibble: 1 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     2     5     7

# Would like to keep all rows instead. Possible workaround: 
df1 <- df %>% select(-y)
df2 <- df %>% filter(length(y) > 0) %>% unnest()
left_join(df1, df2, by = "x")
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7

Perhaps an extra example in the documentation to highlight this feature of unnest could help to make users more aware of this situation? (I'd be happy to draft a PR if that was the case)...

leungi commented 6 years ago

Hadley/Mark, thanks for reviewing this; apologies for delayed reply as I got tied up with work.

The original data in question came from an API call, and I didn't save it, but it's similar to what Mark has. He's also on point regarding my issue.

Mark's solution yields the intended result as my workaround:

Based on Mark's comments, this issue/phenomena is to by design, though I believe it'll be useful to have an argument in unnest to keep non-empty list after unnesting. I find these situations happening quite often in my work.

hadley commented 6 years ago

Hmmmmm, maybe it's worth having an option for this, but I'm not sure what to call it.

leungi commented 6 years ago

Thanks Hadley.

Suggestion: na.drop = T/F

markdly commented 6 years ago

How about empty = "drop" or "fill"

(e.g. similar approach to the extra and fill option values in separate)

hadley commented 6 years ago

replace_na() now works with list-cols so you can at least do this:

library(tidyr)
library(tibble)

df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

df %>% 
  replace_na(list(y = list(tibble(a = NA, b = NA)))) %>%
  unnest()
#> # A tibble: 2 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     1    NA    NA
#> 2     2     5     7
leungi commented 6 years ago

Hadley,

I'm using tibble_1.3.4 and tidyr_0.7.2, but can't reproduce your output, unless the upgraded replace_na is not in latest CRAN versions yet.

library(tidyr)
library(tibble)

df <- tibble(x = 1:2, y = list(tibble(), tibble(a = 5, b = 7)))

df %>% 
  replace_na(list(y = list(tibble(a = NA, b = NA)))) %>%
  unnest()
#> # A tibble: 1 x 3
#>       x     a     b
#>   <int> <dbl> <dbl>
#> 1     2     5     7
hadley commented 6 years ago

It's in the dev version, sorry.

hadley commented 6 years ago

I've now hit this use case in two practical problems, so I definitely believe it should be an option.

leungi commented 6 years ago

Happy 2018; thanks for update.

Look forward to your enhancements!

jrgilbertson commented 6 years ago

Thank you for the temporary workaround (and upcoming feature)! Spent more time than I'd like to admit tonight trying to figure out this exact use case...

hadley commented 6 years ago

Note: this is related to a left join vs an inner join.

leungi commented 6 years ago

Thanks for update and linking issue @hadley; will try it out when nest_join() turns on in dev version.

hadley commented 6 years ago

I think this might be best as drop = FALSE and can be implemented internally with something like:


explicit_na <- function(x) {
  dims <- length(dims(x)) 
  if (dims == 0L && length(x) == 0) {
    x[NA_integer]
  } else if (dims == 2L && nrow(x) == 0) {
   x[NA_integer, , drop = FALSE]
  } else {
   x
  }
}
markdly commented 6 years ago

I couldn't get explicit_na to work as it is, but if I tweak it slightly:

library(dplyr)
explicit_na <- function(x) {
  dims <- length(dim(x))
  if (dims == 0L && length(x) == 0) {  
    x <- ifelse(is.list(x) && !is.data.frame(x), list(NA_integer_), NA_integer_)
  } else if (dims == 2L && nrow(x) == 0) {  
    x[TRUE, ] <- NA_integer_
  }
  x
}

These cases return what I'd expect

character(0) %>% explicit_na()
#> [1] NA

list() %>% explicit_na()
#> [[1]]
#> [1] NA

data.frame(a = character()) %>% explicit_na()
#>      a
#> 1 <NA>

But now I'm wondering what should happen if a dataframe has no names?

df <- data.frame()
df
#> data frame with 0 columns and 0 rows

df %>% explicit_na()
#> data frame with 0 columns and 1 row
hadley commented 6 years ago

That function is just a reminder for me. It needs testing.

hadley commented 5 years ago

Note to self: can't use .drop because it's already used to control if the variables being unnested are dropped.

hadley commented 5 years ago

Currently implemented in unnest2(), which I'm going to re-unify with unnest() shortly.