nathaneastwood / poorman

A poor man's dependency free grammar of data manipulation
https://nathaneastwood.github.io/poorman/
Other
339 stars 15 forks source link

feat: Implement separate() #107

Open etiennebacher opened 2 years ago

etiennebacher commented 2 years ago

This PR implements separate() to split a column into several ones, either based on a regex or on location.

@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.

TODO:

Some examples:

suppressPackageStartupMessages(library(poorman))

df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
df
#>      x
#> 1 <NA>
#> 2  x.y
#> 3  x.z
#> 4  y.z
df %>% separate(x, c("A", "B"))
#>      A    B
#> 1 <NA> <NA>
#> 2    x    y
#> 3    x    z
#> 4    y    z

df <- data.frame(x = c(NA, "a1b", "c4d", "e9g"))
df
#>      x
#> 1 <NA>
#> 2  a1b
#> 3  c4d
#> 4  e9g
df %>% separate(x, c("A","B"), sep = "[0-9]")
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    c    d
#> 4    e    g

df <- data.frame(x = c("x", "x y", "x y z", NA))
df
#>       x
#> 1     x
#> 2   x y
#> 3 x y z
#> 4  <NA>
df %>% separate(x, c("a", "b"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#>      a    b
#> 1    x <NA>
#> 2    x    y
#> 3    x    y
#> 4 <NA> <NA>

Created on 2022-08-03 by the reprex package (v2.0.1)

nathaneastwood commented 2 years ago

Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.

nathaneastwood commented 2 years ago

I took a look into some of this re extra = "merge". I think we could use the following to split up the strings

    n_max <- length(into)
    m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
    if (n_max > 0) {
      m <- lapply(m, function(x) {
        i <- seq_along(x) < n_max
        structure(
          x[i],
          match.length = attr(x, "match.length")[i],
          index.type = attr(x, "index.type"),
          useBytes = attr(x, "useBytes")
        )
      })
    }
    regmatches(as.character(data[[col]]), m, invert = TRUE)

The problem is this doesn't get rid of "extra" information.

df <- data.frame(x = c("x", "x y", "x y z", NA))
#      a    b
# 1    x <NA>
# 2    x    y
# 3    x  y z
# 4 <NA> <NA>

Row 3 should be x y with a warning. This is different to the approach you took which is using strsplit().

nathaneastwood commented 2 years ago

Here is an example of what fill is supposed to do (taken from the tidyr tests):

r$> df                                                                 
# A tibble: 2 × 1
  x    
  <chr>
1 a b  
2 a b c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 NA    a     b    
2 a     b     c    

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")           
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c

r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")            
# A tibble: 2 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     NA   
2 a     b     c    
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
etiennebacher commented 1 year ago

Passing thought: might be worth implementing tidyr's new functions separate_wider_delim(), separate_wider_position(), separate_wider_regex(). separate() would then only call one of these depending on the type of input

nathaneastwood commented 1 year ago

I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.

etiennebacher commented 1 year ago

I understand that you can't cover all new things in dplyr and tidyr. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate(). And then, since those functions will exist, it won't cost much to export them.

nathaneastwood commented 1 year ago

Ah I see what you mean. Yeah that seems like a good point, actually.