markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

tidytable::separate() function error #666

Closed tungttnguyen closed 2 years ago

tungttnguyen commented 2 years ago

Hi Mark,

In the example below tidytable::separate() threw error while tidyr::separate() did not. Can you check what went wrong? Thank you!

library(tidyr)
packageVersion("tidyr")
#> [1] '1.2.1'

library(tidytable)
packageVersion("tidytable")
#> [1] '0.9.0'

df1 <- data.frame(
  stringsAsFactors = FALSE,
  Path = c("/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/"),
  Index = c("1999-10-01 16:00:00","1999-11-01 16:00:00",
            "1999-12-01 16:00:00"),
  Value = c(3.94, 2.14, 1.39)
)
df1
#>                        Path               Index Value
#> 1 /MODEL/WAT/VAL//1MON/VOL/ 1999-10-01 16:00:00  3.94
#> 2 /MODEL/WAT/VAL//1MON/VOL/ 1999-11-01 16:00:00  2.14
#> 3 /MODEL/WAT/VAL//1MON/VOL/ 1999-12-01 16:00:00  1.39

### no error
df1_tidyr <- df1 %>% 
  tidyr::separate(Path, into = c("dummy1", 
                                 "A", "B", "C", "D", "E", "F",
                                 "dummy2"),
                  sep = "/") 
df1_tidyr
#>   dummy1     A   B   C D    E   F dummy2               Index Value
#> 1        MODEL WAT VAL   1MON VOL        1999-10-01 16:00:00  3.94
#> 2        MODEL WAT VAL   1MON VOL        1999-11-01 16:00:00  2.14
#> 3        MODEL WAT VAL   1MON VOL        1999-12-01 16:00:00  1.39

### error
df1_dt <- df1 %>% 
  separate(Path, into = c("dummy1", 
                          "A", "B", "C", "D", "E", "F",
                          "dummy2"),
           sep = "/") 
#> Error in data.table::tstrsplit(Path, split = "/", fixed = TRUE, keep = 1:8, : 'keep' should contain integer values between 1 and 7.
df1_dt
#> Error in eval(expr, envir, enclos): object 'df1_dt' not found

Created on 2022-10-21 with reprex v2.0.2

markfairbanks commented 2 years ago

Smaller reprex:

pacman::p_load(tidytable)

df <- tidytable(x = "/a/b/")

df %>%
  separate(x, into = c("dummy1", "a", "b", "dummy2"), sep = "/")
#> Error in data.table::tstrsplit(x, split = "/", fixed = TRUE, keep = 1:4, : 'keep' should contain integer values between 1 and 3.
markfairbanks commented 2 years ago

This basically occurs because of the differences between base::strsplit() (which data.table utilizes) and stringr::str_split() (which tidyr utilizes). base::strsplit() ignores the empty split at the end:

chr <- "/a/b/"

strsplit(chr, split = "/")
#> [[1]]
#> [1] ""  "a" "b"

stringr::str_split(chr, pattern = "/")
#> [[1]]
#> [1] ""  "a" "b" ""

For now the workaround is to drop "dummy2" from into = since the last "empty" column isn't created when using data.table:

pacman::p_load(tidytable)

df1 <- data.frame(
  stringsAsFactors = FALSE,
  Path = c("/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/"),
  Index = c("1999-10-01 16:00:00","1999-11-01 16:00:00",
            "1999-12-01 16:00:00"),
  Value = c(3.94, 2.14, 1.39)
)

df1 %>%
  separate(Path, into = c("dummy1",
                          "A", "B", "C", "D", "E", "F"),
           sep = "/")
#> # A tidytable: 3 × 9
#>   Index               Value dummy1 A     B     C     D     E     F    
#>   <chr>               <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1999-10-01 16:00:00  3.94 ""     MODEL WAT   VAL   ""    1MON  VOL  
#> 2 1999-11-01 16:00:00  2.14 ""     MODEL WAT   VAL   ""    1MON  VOL  
#> 3 1999-12-01 16:00:00  1.39 ""     MODEL WAT   VAL   ""    1MON  VOL
tungttnguyen commented 2 years ago

Thank you!

markfairbanks commented 2 years ago

@tungttnguyen - I've decided I'm going to just leave this one as-is. It seems like an edge case, and fixing it will have too big of a cost on performance.

Thanks for reporting either way 😄

markfairbanks commented 2 years ago

As I think about this some more - it won't have tidyr behavior exactly in your case, but I can build it so that it doesn't error when too many (or too few) columns are provided to into =. This is something that works in tidyr::separate().

pacman::p_load(tidyr)

df <- tibble(x = c("a_a", "b_b", "c_c"))

# Too many
df %>%
  separate(x, c("one", "two", "three"), sep = "_")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].
#> # A tibble: 3 × 3
#>   one   two   three
#>   <chr> <chr> <chr>
#> 1 a     a     <NA> 
#> 2 b     b     <NA> 
#> 3 c     c     <NA>

# Too few
df %>%
  separate(x, "one", sep = "_")
#> Warning: Expected 1 pieces. Additional pieces discarded in 3 rows [1, 2, 3].
#> # A tibble: 3 × 1
#>   one  
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 c
markfairbanks commented 2 years ago

All set.

pacman::p_load(tidytable)

df1 <- data.frame(
  stringsAsFactors = FALSE,
  Path = c("/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/",
           "/MODEL/WAT/VAL//1MON/VOL/"),
  Index = c("1999-10-01 16:00:00","1999-11-01 16:00:00",
            "1999-12-01 16:00:00"),
  Value = c(3.94, 2.14, 1.39)
)

df1 %>%
  separate(Path, into = c("dummy1",
                          "A", "B", "C", "D", "E", "F",
                          "dummy2"),
           sep = "/")
#> # A tidytable: 3 × 10
#>   Index               Value dummy1 A     B     C     D     E     F     dummy2
#>   <chr>               <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> 
#> 1 1999-10-01 16:00:00  3.94 ""     MODEL WAT   VAL   ""    1MON  VOL   <NA>  
#> 2 1999-11-01 16:00:00  2.14 ""     MODEL WAT   VAL   ""    1MON  VOL   <NA>  
#> 3 1999-12-01 16:00:00  1.39 ""     MODEL WAT   VAL   ""    1MON  VOL   <NA>