randrescastaneda / joyn

joyn provides a set of tools to analyze the quality of merging (i.e., joining) data frames. It is a JOY to join with joyn
https://randrescastaneda.github.io/joyn/
Other
7 stars 2 forks source link

Just a question regarding usage of collapse::join #58

Closed SebKrantz closed 3 months ago

SebKrantz commented 5 months ago

Hi @randrescastaneda. I just looked at your workhorse function, and wanted to ask what you are missing in collapse for the m:m join. As far as I can tell, with multiple = TRUE, collapse::join does implement a full cartesian product, where each row in x is matched with each corresponding row in y. E.g.

library(collapse)
#> Warning: package 'collapse' was built under R version 4.3.1
#> collapse 2.0.13, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
join(data.frame(id = c(rep(1, 3), 2)), 
     data.frame(id = c(3, rep(1, 3))), multiple = TRUE, how = "full", 
     column = list(".join", c("x", "y", "x & y")))
#> full join: x[id] 9/10 (90%) <m:m> y[id] 3/4 (75%)
#>    id .join
#> 1   1 x & y
#> 2   1 x & y
#> 3   1 x & y
#> 4   1 x & y
#> 5   1 x & y
#> 6   1 x & y
#> 7   1 x & y
#> 8   1 x & y
#> 9   1 x & y
#> 10  2     x
#> 11  3     y

Created on 2024-04-16 with reprex v2.0.2

randrescastaneda commented 5 months ago

Hi @SebKrantz,

Thanks for the question! We really appreciate your interest in joyn.

We think your work with collapse is fantastic and this is why we use collapse::join as the heavy lifter of joyn. Yet, It's possible the multiple argument wasn't fully functional when we first implemented it or it was not developed at that time... I don't remember.

@zander-prinsloo, any insights on why we opted for a different approach in joyn's m:m matching type? Your testing and benchmarking of data.table and collapse performance were invaluable. Perhaps revisiting the multiple argument in joyn might be worthwhile.

Could you take a look and share your thoughts with Sebastian?

Thanks.

SebKrantz commented 5 months ago

Just a few further notes: Please don't use set_collapse() in a package, as these are global options that will also affect the behavior of collapse outside of joyn. %in% is defined as:

"%in%" <- function(x, table) as.logical(fmatch(x, table, 0L, overid = 2L))

Regarding memory efficiency, data.table::copy() does a full deep copy of the frame, which is usually never necessary when dealing with data.frames. Also as.data.table() is very expensive. I'd suggest to use dt_copy = qDT(dt), or, if you want to use data.table's own conversion tools: dt_copy = setDT(qDF(dt)). Both qDF() and qDT() create a shallow copy, i.e., they copy the outer part of the frame but not the vectors storing the data. In both cases, you can create new columns using :=, but if you modify existing columns it will also affect the original frame (in that case you indeed need to copy(), but deep copying the whole frame is not worth the benefits of :=, so I'd use fmutate/ftransform for that).

More on topic again. I'm thinking you could probably benefit from specifying attr = TRUE in collapse::join and use some of that information to compute some of the statistics more efficiently. Practically all statistics (if verbose > 0) and checks (if validate != "m:m") in collapse::join are based on the output of fmatch(..., count = TRUE), see here and here. Currently, count = TRUE is only invoked if verbose > 0 || validate != "m:m", but I could alter this insofar that it is also invoked if !is.null(attr) (that would make sense actually, I've changed it already in the development version). Even with the current version, you could ensure that the attributes are attached by keeping verbose = 1L and using a small hack to suppress the output: invisible(capture.output(res <- join(..., verbose = 1L, attr = TRUE))).

Anyway, just giving you my five cents here if it is useful. I like this work because it goes even further than what I, having switched to R from STATA as well long time ago, intended to accomplish with collapse::join. Personally, I am satisfied with the functionality and more parsimonious verbosity of collapse::join, but I am convinced many economists will appreciate having a join function that works 'exactly' like STATA.

zander-prinsloo commented 5 months ago

Hi @SebKrantz , thanks for these comments - super useful! And, as you can see, we are really using collapse a lot and really appreciate your work. I'm glad you

On the m:m join, yes we implemented the data.table cartesian product merge last year before the update to collapse::join(). We would definitely like to use collapse as far as possible so will make the change in the next release. Thanks for that!

On the use of set_collapse(), thanks for bringing that up. I will prioritise modifying that. We are aware of the inefficiency regarding data.table::copy() and just wanted to get this version out. Am currently working on removing it with the goal of depending as much on collapse as possible. @RossanaTat take a look here as well, and we can discuss.

Thanks again, will be in touch.

SebKrantz commented 4 months ago

Just noting that the change with the attr = TRUE argument always providing full information is part of 2.0.14 which is now GitHub and R-universe released. CRAN probably in 2 weeks. The function has also become a bit smarter in the (not so uncommon) case that two datasets already perfectly match.

SebKrantz commented 3 months ago

@zander-prinsloo and @RossanaTat, I've had a look at the code but found it too complicated to propose concrete edits :). However, following up on my comment, I've come up with a barebones joyn function that efficiently wraps collapse::join (collapse versions >= 2.0.14). Hope it is useful in case you plan to improve performance, and of course happy to provide further explanations.

options(fastverse.styling = FALSE)
library(fastverse)
#> -- Attaching packages --------------------------------------- fastverse 0.3.3 --
#> Warning: package 'data.table' was built under R version 4.3.1
#> v data.table 1.15.4     v kit        0.0.17
#> v magrittr   2.0.3      v collapse   2.0.15
fastverse_extend(joyn)
#> -- Attaching extension packages ----------------------------- fastverse 0.3.3 --
#> v joyn 0.2.0.9004
#> -- Conflicts ------------------------------------------ fastverse_conflicts() --
#> x joyn::merge() masks base::merge()

joyn_light <- function(x, y, by = NULL) {

  res <- join(x, y, by,
              column = list(".joyn", c("x", "y", "x & y")), # factor variable is more efficient than character. Can use unattrib() to get integer version
              verbose = 0,
              how = "full",
              multiple = TRUE,
              drop.dup.cols = "y",
              overid = 2L, # If you want to silence the overidentification warning, passed to fmatch()
              attr = TRUE)

  m <- attr(res, "join.match")$match
  attr(res, "join.match") <- NULL

  N <- fnrow(res)
  nm_x <- attr(m, "N.nomatch") # Number of non-matched x values
  nm_y <- fnrow(y) - attr(m, "N.distinct") # Number of non-matched y values. If multiple = FALSE attr(m, "N.distinct") = number of unique matches. 

  counts <- c(nm_x, N-nm_x-nm_y, nm_y, N)
  report <- data.table(
    .joyn = c("x", "x & y", "y", "total"),
    n = counts,
    percent = paste0(round(counts / N * 100, 1), "%")
  ) |> fsubset(n > 0)

  return(list(report = report, result = res))
}

x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
                t  = c(1L, 2L, 1L, 2L, NA_integer_),
                x  = 11:15)

y1 = data.table(id = 1:2,
                y  = c(11L, 15L))

x2 = data.table(id = c(1, 1, 2, 3, NA),
                t  = c(1L, 2L, 1L, 2L, NA_integer_),
                x  = c(16, 12, NA, NA, 15))

y2 = data.table(id = c(1, 2, 5, 6, 3),
                yd = c(1, 2, 5, 6, 3),
                y  = c(11L, 15L, 20L, 13L, 10L),
                x  = c(16:20))

joyn(x1, y1, match_type = "m:1")
#> 
#> ── JOYn Report ──
#> 
#>     .joyn     n percent
#>    <char> <int>  <char>
#> 1:      x     2     40%
#> 2:  x & y     3     60%
#> 3:  total     5    100%
#> ────────────────────────────────────────────────────────── End of JOYn report ──
#> ℹ ❯ Joyn's report available in variable .joyn
#> ℹ ❯ Removing key variables id from id and y
#>       id     t     x     y  .joyn
#>    <int> <int> <int> <int> <char>
#> 1:     1     1    11    11  x & y
#> 2:     1     2    12    11  x & y
#> 3:     2     1    13    15  x & y
#> 4:     3     2    14    NA      x
#> 5:    NA    NA    15    NA      x
joyn_light(x1, y1)
#> $report
#>     .joyn     n percent
#>    <char> <int>  <char>
#> 1:      x     2     40%
#> 2:  x & y     3     60%
#> 3:  total     5    100%
#> 
#> $result
#>       id     t     x     y  .joyn
#>    <int> <int> <int> <int> <fctr>
#> 1:     1     1    11    11  x & y
#> 2:     1     2    12    11  x & y
#> 3:     2     1    13    15  x & y
#> 4:     3     2    14    NA      x
#> 5:    NA    NA    15    NA      x

joyn(x2, y2, by = "id", match_type = "m:1")
#> 
#> ── JOYn Report ──
#> 
#>     .joyn     n percent
#>    <char> <int>  <char>
#> 1:      x     1   14.3%
#> 2:  x & y     4   57.1%
#> 3:      y     2   28.6%
#> 4:  total     7    100%
#> ────────────────────────────────────────────────────────── End of JOYn report ──
#> ℹ ❯ Joyn's report available in variable .joyn
#> ℹ ❯ Removing key variables id from id, yd, y, and x
#>       id     t     x    yd     y  .joyn
#>    <num> <int> <num> <num> <int> <char>
#> 1:     1     1    16     1    11  x & y
#> 2:     1     2    12     1    11  x & y
#> 3:     2     1    NA     2    15  x & y
#> 4:     3     2    NA     3    10  x & y
#> 5:    NA    NA    15    NA    NA      x
#> 6:     5    NA    NA     5    20      y
#> 7:     6    NA    NA     6    13      y
joyn_light(x2, y2, by = "id")
#> $report
#>     .joyn     n percent
#>    <char> <int>  <char>
#> 1:      x     1   14.3%
#> 2:  x & y     4   57.1%
#> 3:      y     2   28.6%
#> 4:  total     7    100%
#> 
#> $result
#>       id     t     x    yd     y  .joyn
#>    <num> <int> <num> <num> <int> <fctr>
#> 1:     1     1    16     1    11  x & y
#> 2:     1     2    12     1    11  x & y
#> 3:     2     1    NA     2    15  x & y
#> 4:     3     2    NA     3    10  x & y
#> 5:    NA    NA    15    NA    NA      x
#> 6:     5    NA    NA     5    20      y
#> 7:     6    NA    NA     6    13      y

Created on 2024-05-31 with reprex v2.0.2

randrescastaneda commented 3 months ago

Hi @SebKrantz ,

Thank you so much for the suggestion. We highly appreciate it. We will work on it and get back to you.

Best, R.Andres