multiple replacements - Githubissues

karoliskoncevicius commented 4 years ago

This issue again :) But I think I have a proposal that might actually work, unless I am missing something.

In short, replacing multiple values at once would be a nice step, especially when:

numerics are replaced with characters

x <- 1:9 x %in[)% c(1,4) <- "low" x %in[)% c(4,7) <-" middle" # no longer a numeric vector...
regex replacements overlap with new checks

x <- c("house", "home", "bus", "boat", "car") x %in~% c("^h") <- "building" x %in~% c("^b", "^c") <- "transportation" # now building will be replaced too
counts will change and overlap

x <- c("a", "b", "b", "c", "c", "c") x %in#% 1:2 <- "rare" x %in#% 3:4 <- "common" # now rare will be replaced with common

To overcome these - a syntax allowing to replace multiple values at once would help. My suggestion, that I wish to get some feedback on is to do this if rhs is a list:

  x <- 1:9
  x %in[)% c(1,4,7,10) <- list("low", "medium", "high")

Of course it's a bit tricky how this would work with %out% - at least might be non-intuitive, but probably would be included for consistency.

@moodymudskipper what do you think?

karoliskoncevicius commented 4 years ago

I think this is much needed functionality, I was using this package for my work on the weekend and was missing it at least a few times. Also one person on your twitter post (which was very nice by the way, we got a great reception!) left this:

Can replacement take vector values on right hand side?

Which in my mind was hinting at the same thing.

moodymudskipper commented 4 years ago

Is it normal to have 3 elements in value and 4 in range here ?

x %in[)% c(1,4,7,10) <- list("low", "medium", "high")

It seems to me you want to have the range argument mean something different depending on the class of the value argument is that right ?

If so I'm afraid this will be a big source a confusion, though I guess it could technically work.

In x %in~% c("^b", "^c") <- list("building", "transportation") you're mapping "^c" to "transportation" but they're not next to each other, we need to mentally count, that's cognitive load, + if they are stored in objects it needs to be 2 different objects. I try hard but I still don't get why you want so much to do this with infix operators.

For intervals I suggested this :

age := dplyr::case_when(
  .. < 18 ~ "kid/teen",
  .. < 28 ~ "young adult",
  TRUE ~ "adult")

Or

cut(age) <- c(0, "kid/teen" = 18, "young adult" = 28, "adult" = Inf)

But I really think case_when is the way to go for those. And it's not more verbose than your option.

I see potential value in a family of functions to recode based on sets, ranges, regex or counts, but not infix.

Here are some ideas for a recode family :

regex_recode(x, building = "^b"  , transportation = "^c")
range_recode(x, low = 1 <= ~. < 4, middle = 4 <= ~. < 7) # metaprogramming to validate format
count_recode(x, rare = 1:2, common : 3:4)
# in case of conflict first condition that matches wins

Maybe I still don't understand your use cases. Can you come up with a draft of one of those functions and a realistic use cases that shows off where it shines compared to my proposed alternatives ?

karoliskoncevicius commented 4 years ago

It seems to me you want to have the range argument mean something different depending on the class of the value argument is that right ?

The example was in particular constructed so that it is consistent with %in[)% - that why value has 3 elements. Currently %in[)% takes the whole range. And the proposed syntax divides that range into 3 ranges, and instead of assigning TRUE to all, it assigns different values to different ranges within.

that's cognitive load, + if they are stored in objects it needs to be 2 different objects

Yes, very good point. If this would get long enough it would be a bit burdensome to track which range is related to which value.

I try hard but I still don't get why you want so much to do this with infix operators.

My main wish is precisely as you stated in the previous sentence - to reduce the cognitive load. I would much prefer to have one way of replacing working for both scenarios, not needing to remember different functions for all exceptions.

The problem I see currently, which I tried to describe in the first post too, is that when using the package I very rarely want to only replace one value with another. Multiple replacements seem to be much more common. Hence the replacement operators we have are not really useful for this, and it will almost always be better to do case_when() with %in[)% in its conditions.

But I really think case_when is the way to go for those. And it's not more verbose than your option.

I do think it's more verbose by a little bit - repeating variable and comparison operators:

age <- dplyr::case_when(
    age < 18 ~ "kid/teen",
    age < 28 ~ "young adult",
    age < 60 ~ "adult",
    TRUE ~ "senior")

age %in[)% c(-Inf, 18, 28, 60, Inf) <- c("kid/teen", "young adult", "adult", "senior")

But yes, case_when and more general and more readable in this case. So I am hesitant to offer this syntax now, after your comment. But I still do think that, given that nice syntax is possible, this functionality would be nice to have here.

moodymudskipper commented 4 years ago

I think understand better, I agree that no existing solution feels perfectly right, and that our package feels like it might offer a bridge to something nicer. In particular I don't know any approach that solves the awkwardness of the fact that we have one less interval than breakpoints.

I also like the idea of a generalized approach to recoding and cutting, incl regex support. I'm just not very comfortable with complexifying the existing functions, and I feel your proposed syntax is problematic for reasons given above.

This might need to be a new family of functions with different consistency rules, but by all means if you really feel that your proposed syntax can be intuitive and useful let's try it. case_when itself is not adapted to use with a long structured interval mapping either so it's also problematic in some respects.

karoliskoncevicius commented 4 years ago

I don't know any approach that solves the awkwardness of the fact that we have one less interval than breakpoints.

I think this is because you are used to those numbers being break points, while in this example they are intervals - since we might not want to replace the whole range of values, but only a subset. In particular:

x <- 1:10
x %in[)% c(3,7,9) <- c("3-6", "7-9")

If instead the whole range of x needs to be divided into categories:

x <- %in[)% c(-Inf, 3, 7, 9, Inf) <- c(0, 1, 2, 3)

Basically think "between the values" - to me it's more intuitive than break points for some reason.

Also it is consistent with %in[)% as in - the replaced values are only those that would be marked with TRUE, while break-point solution would not be consistent.

Regarding the syntax - I agree, will think more about it, see if something "pops up" to mind.

moodymudskipper / inops

multiple replacements #43