Open etiennebacher opened 2 years ago
Thanks, this looks nice. I'm going away until the end of the week, starting from tonight. I'll try to look properly when I'm back.
I took a look into some of this re extra = "merge"
. I think we could use the following to split up the strings
n_max <- length(into)
m <- gregexpr(sep, as.character(data[[col]]), perl = TRUE)
if (n_max > 0) {
m <- lapply(m, function(x) {
i <- seq_along(x) < n_max
structure(
x[i],
match.length = attr(x, "match.length")[i],
index.type = attr(x, "index.type"),
useBytes = attr(x, "useBytes")
)
})
}
regmatches(as.character(data[[col]]), m, invert = TRUE)
The problem is this doesn't get rid of "extra" information.
df <- data.frame(x = c("x", "x y", "x y z", NA))
# a b
# 1 x <NA>
# 2 x y
# 3 x y z
# 4 <NA> <NA>
Row 3 should be x y
with a warning. This is different to the approach you took which is using strsplit()
.
Here is an example of what fill
is supposed to do (taken from the tidyr tests):
r$> df
# A tibble: 2 × 1
x
<chr>
1 a b
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "left")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 NA a b
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "right")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 a b NA
2 a b c
r$> tidyr::separate(df, x, c("x", "y", "z"), fill = "warn")
# A tibble: 2 × 3
x y z
<chr> <chr> <chr>
1 a b NA
2 a b c
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Passing thought: might be worth implementing tidyr
's new functions separate_wider_delim()
, separate_wider_position()
, separate_wider_regex()
. separate()
would then only call one of these depending on the type of input
I saw those. I may give them a miss. At some point I need to make a cut off and dplyr and tidyr 1.0.0 make sense to me.
I understand that you can't cover all new things in dplyr
and tidyr
. What I meant is just that even from the developer's point of view, it might be easier/cleaner to create these 3 functions separately and then call them in separate()
. And then, since those functions will exist, it won't cost much to export them.
Ah I see what you mean. Yeah that seems like a good point, actually.
This PR implements
separate()
to split a column into several ones, either based on a regex or on location.@nathaneastwood this PR is not complete, I put it as a draft here so that it is saved somewhere and that you can help with the TODO list if you have some time.
TODO:
extra = "merge"
(1 test failing so far)fill
(the way this argument works is not very clear to me)Some examples:
Created on 2022-08-03 by the reprex package (v2.0.1)