moodymudskipper / safejoin

Wrappers around dplyr functions to join safely using various checks
GNU General Public License v3.0
42 stars 7 forks source link

have more shortcuts in conflict argument ? #13

Closed moodymudskipper closed 2 years ago

moodymudskipper commented 5 years ago

"patch" is not really a shortcut but we could go back to having :

We have a check to impose that factors have the same level, so we can afford to do this as user will have opted in, we can still show messages or warning according to the choice of "l" / "~l" / "" though.

moodymudskipper commented 5 years ago
library(safejoin)
library(dplyr, warn.conflicts = FALSE)
#> Warning: le package 'dplyr' a été compilé avec la version R 3.5.2
d1 <- data.frame(Id=1:3,Name=c("Yann","Anne","Sabri"),Age=c(21,19,31),Height=c(178,169,192),Grade=c(15,12,18))
d2 <- data.frame(Id=c(1,3,4),Name=c("Yann","Sabri","Jui"),Age=c(28,21,15),Sex=c("M","M","F"),City=c("Paris","Paris","Toulouse"))
d1
#>   Id  Name Age Height Grade
#> 1  1  Yann  21    178    15
#> 2  2  Anne  19    169    12
#> 3  3 Sabri  31    192    18
d2
#>   Id  Name Age Sex     City
#> 1  1  Yann  28   M    Paris
#> 2  3 Sabri  21   M    Paris
#> 3  4   Jui  15   F Toulouse

coalesce(factor(c(NA,"a")),factor(c("b",NA)))
#> Warning in `[<-.factor`(`*tmp*`, i, value = structure(1L, .Label = "b",
#> class = "factor")): niveau de facteur incorrect, NAs générés
#> [1] <NA> a   
#> Levels: a

# doesn't work
safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id", 
               conflict = coalesce)
#> Warning in `[<-.factor`(`*tmp*`, i, value = structure(1L, .Label =
#> c("Jui", : niveau de facteur incorrect, NAs générés
#>   Id  Name Age Height Grade  Sex
#> 1  1  Yann  21    178    15    M
#> 2  2  Anne  19    169    12 <NA>
#> 3  3 Sabri  31    192    18    M
#> 4  4  <NA>  15     NA    NA    F

# converts to character
safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id", 
               conflict = ~coalesce(as.character(.x),as.character(.y)))
#>   Id  Name Age Height Grade  Sex
#> 1  1  Yann  21    178    15    M
#> 2  2  Anne  19    169    12 <NA>
#> 3  3 Sabri  31    192    18    M
#> 4  4   Jui  15     NA    NA    F

# converts to factor with extra levels
res <- safe_full_join(d1, select(d2, Id, Name, Age, Sex), by = "Id", 
               conflict = ~{
                 res <- coalesce(as.character(.x),as.character(.y))
                 if(is.factor(.x) && is.factor(.y)){
                   res <- factor(res, union(levels(.x), levels(.y)))
                 } 
                 res
                 })
res
#>   Id  Name Age Height Grade  Sex
#> 1  1  Yann  21    178    15    M
#> 2  2  Anne  19    169    12 <NA>
#> 3  3 Sabri  31    192    18    M
#> 4  4   Jui  15     NA    NA    F
res$Name
#> [1] Yann  Anne  Sabri Jui  
#> Levels: Anne Sabri Yann Jui

Created on 2019-02-25 by the reprex package (v0.2.0).

moodymudskipper commented 5 years ago

We could have "nest" which would nest the 2 conflicting colums with names "x" and "y" under the original name.

"coalesce_xy" is a bit strange because it's 5 characters longer than coalesce and does the same with most of the time, but it is safe. It shows consistency issues though. When grouping columns are char and fact or fact with diff levels dplyr coerces to char with warning, we should expand the levels in the latter case, but then we're not dplyr by default, which I think is ok

moodymudskipper commented 5 years ago

need to mature a bit more,

nest can be obtained with .conflict = ~tibble(.x=.x, .y=.y)

moodymudskipper commented 5 years ago

coalesce as it is doesn't even permit to coalesce integers with numerics (1:3 is integer, rep(1,3) is numeric..), so this feature wouldn't be so bad, the conversions are very verbose.

"add" as an alias to + is not so useful but "add_na_rm" is, as a shortcut to ~rowSums(data.frame(.x, .y), na.rm = TRUE)

Have substract by symetry, there won't be as many use cases but there was one on SO.

The documented way of nesting works simply enough with tibbles but :

having "nest" we could have default names, maybe work something out for list inputs (column nests more and more subcolumns as we go), as it would be a special case it doesn't need to be just a shortcut, and could name the columns appropriately.

RaphaelPaut commented 5 years ago

Hello, You're package is awesome ! I would like to know if it would be possible to put text in case of conflict? I just want to identify conflicts by putting "CONFLICTS" inside the cell. Do you know whether it is possible? Thank you ! Raphael

moodymudskipper commented 5 years ago

Hi, @RaphaelPaut , and thanks for the kind words, I appreciate!

I would think that the answer is yes, but I'm not 100% sure I get what output you expect, do you want a constant column of "CONFLICTS" for conflicted variables, or do you want "CONFLICTS" when you have several non NA values to choose from ?

The easier would be that you provide a minimal reproducible example (a couple of small data.frame to join, the function you want to apply and the result you expect).

RaphaelPaut commented 5 years ago

. Hi @moodymudskipper , thank you for this quick answer ! Actually, it is the second case, I would like to print "conflict" when I have several non NA values to choose from.

For example : d1

> Id Name Age

> 1 1 Yann 21

> 2 2 Anne 19

> 3 3 Sabri 31

d2

> Id Name Age

> 1 1 Yann 28

> 2 2 Anne 19

> 3 3 Sabri 31

The result would be something like this:

> Id Name Age

> 1 1 Yann "Conflict"

> 2 2 Anne 19

> 3 3 Sabri 31

Another request, is it possible to make an operation such as an addition between several non NA values to choose from?

Thank you !

Raphaël

moodymudskipper commented 5 years ago

There you go Raphaël :

library(safejoin)

d1 <- read.table(text="
Id Name Age
1 1 Yann 21
2 2 Anne 19
3 3 Sabri 31")

d2 <- read.table(text="
Id Name Age
1 1 Yann 28
2 2 Anne 19
3 3 Sabri 31")

safe_inner_join(d1, d2, by = c("Id", "Name"), conflict = ~ifelse(.x == .y, .x, "conflict"),)
#>   Id  Name      Age
#> 1  1  Yann conflict
#> 2  2  Anne       19
#> 3  3 Sabri       31

d3 <- data.frame(a=1:3, b = c(3:4,NA))
d4 <- data.frame(a=1:3, b = c(5,NA,NA))
safe_inner_join(d3, d4, by = "a", conflict = ~rowSums(cbind(.x,.y), na.rm = TRUE))
#>   a b
#> 1 1 8
#> 2 2 4
#> 3 3 0

Created on 2019-07-23 by the reprex package (v0.3.0)

The conflict argument is a function that take as arguments each pair of conflicted columns, feeding the column from the x table to the first argument and the column from the second table to the second. Therefore ifelse is useful in this context.

Note that in the first case your age column became character, you might want something like NaN instead of "conflict" so the column can stay numeric.

RaphaelPaut commented 5 years ago

Hi ! I seems perfect ! I will try this out and let you know ! Raphaël

moodymudskipper commented 2 years ago

The fun now happens at https://github.com/moodymudskipper/powerjoin