tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

unnest with ptype raises conversion error despite identical types #1052

Closed mmuurr closed 3 years ago

mmuurr commented 4 years ago

I don't know if this is an issue with tidyr or with vctrs, or simply an issue with not using the ptype argument correctly in various rectangling operations. But it appears from the error message below that a cast from double to double is failing due to precision loss, which is confusing to me. Either that or the lossiness error is due to the containing dataframe/tibble, but in the example below I've tried to construct the ptype argument to match as-closely-as-possible the x$bar object.

library(tidyr)
library(vctrs)

x <- tibble::tibble(
  foo = "foo",
  bar = list(
    tibble::tibble(bar1 = as.double(1:3), bar2 = as.double(3:1))
  )
)

## works as expected:
unnest(x)
#> Warning: `cols` is now required when using unnest().
#> Please use `cols = c(bar)`
#> # A tibble: 3 x 3
#>   foo    bar1  bar2
#>   <chr> <dbl> <dbl>
#> 1 foo       1     3
#> 2 foo       2     2
#> 3 foo       3     1

## should produce same result as above (I think), but errs instead:
unnest(x, bar, ptype = tibble::tibble(bar1 = double(0), bar2 = double(0)))
#> Error: Can't convert from <data.frame<bar:tbl_df<
#>   bar1: double
#>   bar2: double
#> >>> to <tbl_df<
#>   bar1: double
#>   bar2: double
#> >> due to loss of precision.
#> Dropped variables: `bar`
Session info ```R sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-apple-darwin19.4.0 (64-bit) #> Running under: macOS Catalina 10.15.3 #> #> Matrix products: default #> BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.10_1/lib/libopenblasp-r0.3.10.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices datasets utils methods base #> #> other attached packages: #> [1] vctrs_0.3.1 tidyr_1.1.0 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.5 knitr_1.29 magrittr_1.5 tidyselect_1.1.0 #> [5] R6_2.4.1 rlang_0.4.7 fansi_0.4.1 stringr_1.4.0 #> [9] highr_0.8 dplyr_1.0.0 tools_3.6.3 xfun_0.15 #> [13] utf8_1.1.4 cli_2.0.2 htmltools_0.5.0 ellipsis_0.3.1 #> [17] assertthat_0.2.1 yaml_2.2.1 digest_0.6.25 tibble_3.0.3 #> [21] lifecycle_0.2.0 crayon_1.3.4 purrr_0.3.4 glue_1.4.1 #> [25] evaluate_0.14 rmarkdown_2.3 stringi_1.4.6 compiler_3.6.3 #> [29] pillar_1.4.6 generics_0.0.2 renv_0.11.0 pkgconfig_2.0.3 ```
hadley commented 4 years ago

I don't have any feedback on the underlying problem but I noticed that the types are

<data.frame
  <bar: tbl_df<
    bar1: double
    bar2: double
  >>
>

vs

<tbl_df<
   bar1: double
   bar2: double
>>
mgirlich commented 3 years ago

The correct way to specify this is to put your ptype in an extra tibble(bar = ...). See below

library(tidyr)
x <- tibble::tibble(
  foo = "foo",
  bar = list(
    tibble::tibble(bar1 = as.double(1:3), bar2 = as.double(3:1))
  )
)

unnest(x, bar)
#> # A tibble: 3 x 3
#>   foo    bar1  bar2
#>   <chr> <dbl> <dbl>
#> 1 foo       1     3
#> 2 foo       2     2
#> 3 foo       3     1
unnest(x, bar, ptype = tibble(bar = tibble(bar1 = double(0), bar2 = double(0))))
#> # A tibble: 3 x 3
#>   foo    bar1  bar2
#>   <chr> <dbl> <dbl>
#> 1 foo       1     3
#> 2 foo       2     2
#> 3 foo       3     1

Created on 2020-12-08 by the reprex package (v0.3.0)

The extra tibble() is quite confusing. On the other hand one could argue it is useful when unnesting multiple columns and needing name_repair:

library(tidyr)
df <- tibble(
  y = list(
    tibble(a = 1),
    tibble(a = 2)
  ),
  z = list(
    tibble(a = "a"),
    tibble(a = "b")
  )
)
df %>% unnest(c(y, z), names_repair = "unique")
#> New names:
#> * a -> a...1
#> * a -> a...2
#> # A tibble: 2 x 2
#>   a...1 a...2
#>   <dbl> <chr>
#> 1     1 a    
#> 2     2 b

df %>% unnest(
  c(y, z),
  names_repair = "unique",
  ptype = tibble(
    y = tibble(a = integer()),
    z = tibble(a = character())
  )
)
#> New names:
#> * a -> a...1
#> * a -> a...2
#> # A tibble: 2 x 2
#>   a...1 a...2
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b

Created on 2020-12-08 by the reprex package (v0.3.0)

Or in general that one can specify via ptype where a column should come from.

Note that one has to specify all the columns that should occur when unnesting. I am not sure whether this is intended or not.

library(tidyr)
df <- tibble(
  y = list(
    tibble(a = 1, b = 1),
    tibble(a = 2)
  )
)

df %>% unnest(
  y,
  names_repair = "unique",
  ptype = tibble(
    y = tibble(a = integer())
  )
)
#> Error: Can't convert from <tbl_df<
#>   a: double
#>   b: double
#> >> to <tbl_df<a:integer>> due to loss of precision.
#> Dropped variables: `b`

Created on 2020-12-08 by the reprex package (v0.3.0)