Closed moodymudskipper closed 2 years ago
library(safejoin)
library(dplyr, warn.conflicts = FALSE)
#> Warning: le package 'dplyr' a été compilé avec la version R 3.5.2
d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),Height=c(178,169,192),Grade=c(15,12,18))
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"))
d1
#> Id Name Age Height Grade
#> 1 1 Yann 21 178 15
#> 2 2 Anne 19 169 12
#> 3 3 Sabri 31 192 18
d2
#> Id Name Age Sex City
#> 1 1 Yann 28 M Paris
#> 2 3 Sabri 21 M Paris
#> 3 4 Jui 15 F Toulouse
coalesce(factor(c(NA,"a")),factor(c("b",NA)))
#> Warning in `[<-.factor`(`*tmp*`, i, value = structure(1L, .Label = "b",
#> class = "factor")): niveau de facteur incorrect, NAs générés
#> [1] <NA> a
#> Levels: a
# doesn't work
safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id",
conflict = coalesce)
#> Warning in `[<-.factor`(`*tmp*`, i, value = structure(1L, .Label =
#> c("Jui", : niveau de facteur incorrect, NAs générés
#> Id Name Age Height Grade Sex
#> 1 1 Yann 21 178 15 M
#> 2 2 Anne 19 169 12 <NA>
#> 3 3 Sabri 31 192 18 M
#> 4 4 <NA> 15 NA NA F
# converts to character
safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id",
conflict = ~coalesce(as.character(.x),as.character(.y)))
#> Id Name Age Height Grade Sex
#> 1 1 Yann 21 178 15 M
#> 2 2 Anne 19 169 12 <NA>
#> 3 3 Sabri 31 192 18 M
#> 4 4 Jui 15 NA NA F
# converts to factor with extra levels
res <- safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id",
conflict = ~{
res <- coalesce(as.character(.x),as.character(.y))
if(is.factor(.x) && is.factor(.y)){
res <- factor(res, union(levels(.x), levels(.y)))
}
res
})
res
#> Id Name Age Height Grade Sex
#> 1 1 Yann 21 178 15 M
#> 2 2 Anne 19 169 12 <NA>
#> 3 3 Sabri 31 192 18 M
#> 4 4 Jui 15 NA NA F
res$Name
#> [1] Yann Anne Sabri Jui
#> Levels: Anne Sabri Yann Jui
Created on 2019-02-25 by the reprex package (v0.2.0).
We could have "nest" which would nest the 2 conflicting colums with names "x" and "y" under the original name.
"coalesce_xy"
is a bit strange because it's 5 characters longer than coalesce
and does the same with most of the time, but it is safe. It shows consistency issues though. When grouping columns are char and fact or fact with diff levels dplyr coerces to char with warning, we should expand the levels in the latter case, but then we're not dplyr by default, which I think is ok
need to mature a bit more,
nest
can be obtained with .conflict = ~tibble(.x=.x, .y=.y)
coalesce
as it is doesn't even permit to coalesce integers with numerics (1:3
is integer, rep(1,3)
is numeric..), so this feature wouldn't be so bad, the conversions are very verbose.
"add"
as an alias to +
is not so useful but "add_na_rm"
is, as a shortcut to ~rowSums(data.frame(.x, .y), na.rm = TRUE)
Have substract by symetry, there won't be as many use cases but there was one on SO.
The documented way of nesting works simply enough with tibbles but :
having "nest" we could have default names, maybe work something out for list inputs (column nests more and more subcolumns as we go), as it would be a special case it doesn't need to be just a shortcut, and could name the columns appropriately.
Hello, You're package is awesome ! I would like to know if it would be possible to put text in case of conflict? I just want to identify conflicts by putting "CONFLICTS" inside the cell. Do you know whether it is possible? Thank you ! Raphael
Hi, @RaphaelPaut , and thanks for the kind words, I appreciate!
I would think that the answer is yes, but I'm not 100% sure I get what output you expect, do you want a constant column of "CONFLICTS" for conflicted variables, or do you want "CONFLICTS" when you have several non NA values to choose from ?
The easier would be that you provide a minimal reproducible example (a couple of small data.frame to join, the function you want to apply and the result you expect).
. Hi @moodymudskipper , thank you for this quick answer ! Actually, it is the second case, I would like to print "conflict" when I have several non NA values to choose from.
For example : d1
d2
The result would be something like this:
Another request, is it possible to make an operation such as an addition between several non NA values to choose from?
Thank you !
Raphaël
There you go Raphaël :
library(safejoin)
d1 <- read.table(text="
Id Name Age
1 1 Yann 21
2 2 Anne 19
3 3 Sabri 31")
d2 <- read.table(text="
Id Name Age
1 1 Yann 28
2 2 Anne 19
3 3 Sabri 31")
safe_inner_join(d1, d2, by = c("Id", "Name"), conflict = ~ifelse(.x == .y, .x, "conflict"),)
#> Id Name Age
#> 1 1 Yann conflict
#> 2 2 Anne 19
#> 3 3 Sabri 31
d3 <- data.frame(a=1:3, b = c(3:4,NA))
d4 <- data.frame(a=1:3, b = c(5,NA,NA))
safe_inner_join(d3, d4, by = "a", conflict = ~rowSums(cbind(.x,.y), na.rm = TRUE))
#> a b
#> 1 1 8
#> 2 2 4
#> 3 3 0
Created on 2019-07-23 by the reprex package (v0.3.0)
The conflict
argument is a function that take as arguments each pair of conflicted columns, feeding the column from the x table to the first argument and the column from the second table to the second. Therefore ifelse
is useful in this context.
Note that in the first case your age column became character, you might want something like NaN
instead of "conflict"
so the column can stay numeric.
Hi ! I seems perfect ! I will try this out and let you know ! Raphaël
The fun now happens at https://github.com/moodymudskipper/powerjoin
"patch" is not really a shortcut but we could go back to having :
coalesce
except that it would handle factors right, by adding levels, and we don't have to attach dplyr or type "dplyr::"We have a check to impose that factors have the same level, so we can afford to do this as user will have opted in, we can still show messages or warning according to the choice of "l" / "~l" / "" though.