fix build against dev dplyr

kriemo commented 4 years ago

> 
> ### ** Examples
> 
> genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
> 
> x <- bed_random(genome, seed = 1010486)
> y <- bed_random(genome, seed = 9203911)
> 
> bed_absdist(x, y, genome)
Error: Join columns must be unique
✖ Problem at position 2
Backtrace:
    █
 1. └─valr::bed_absdist(x, y, genome)
 2.   ├─dplyr::inner_join(genome, ref_points, by = c("chrom", groups_xy))
 3.   └─dplyr:::inner_join.data.frame(...)
 4.     └─dplyr:::join_mutate(...)
 5.       └─dplyr:::join_cols(...)
 6.         └─dplyr:::standardise_join_by(by, x_names = x_names, y_names = y_names)
 7.           └─dplyr:::check_join_vars(by$x, x_names)
Execution halted

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 424 | SKIPPED: 3 | WARNINGS: 2 | FAILED: 15 ]
  1. Error: absdist calculation is correct (@test_absdist.r#22) 
  2. Error: self absdist is 0 (@test_absdist.r#35) 
  3. Error: x ivls without matching y-ivls chroms are reported with absdist = NA (@test_absdist.r#56) 
  4. Error: ensure that absdist is calculated with respect to input tbls issue#108 (@test_absdist.r#87) 
  5. Error: old dataframe groupings (dplyr v. < 0.7.9.900) are tolerated (@test_groups.r#87) 
  6. Failure: unmatched groups are included when invert = TRUE (@test_intersect.r#329) 
  7. Failure: book-ended intervals are not reported (@test_map.r#104) 
  8. Failure: basic partition works (bedops partition1 test) (@test_partition.r#36) 
  9. Failure: extended partition works (bedops partition2 test) (@test_partition.r#119) 
  1. ...

  Error: testthat unit tests failed
  Execution halted

kriemo commented 4 years ago

Most of the errors are related to testthat:test_equal no longer considering a trbl_interval() object equal to a tibble::tribble(), due to differing class attributes. Substituting testthat::expect_equivalent() gets around this issue. The alternative fix is mentioned in #274, but that would be a big breaking change that I am not in favor of at this point.

library(testthat)
library(valr)
x <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 2000,
  "chr1", 1000, 400
)

pred <- tibble::tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 1000, 400,
  "chr1", 1000, 2000
)

res <- bed_sort(x)
expect_equal(res, pred)
#> Error: `res` not equal to `pred`.
#> Attributes: < Length mismatch: comparison on first 2 components >
#> Attributes: < Component "class": Lengths (4, 3) differ (string compare on first 3) >
#> Attributes: < Component "class": 3 string mismatches >
class(res)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"
class(pred)
#> [1] "tbl_df"     "tbl"        "data.frame"
pred == res
#>      chrom start  end
#> [1,]  TRUE  TRUE TRUE
#> [2,]  TRUE  TRUE TRUE
expect_equivalent(res, pred)

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

kriemo commented 4 years ago

arrange has a performance regression (https://github.com/tidyverse/dplyr/issues/4962) in the dev version of dplyr which will slow down many of valr's functions. The regression will likely be fixed before being released as v1.0.0. However testing on my end suggests that just using base R order with the radix sorting method is faster than CRAN dplyr, (or dev dplyr) by ~25-40%. I suggest we modify bed_sort to use base R unless arrange is fundamentally rewritten in v1.0.0.

Dev dplyr performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.99.9002'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 6.6252349 6.6252349
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2678227 0.2678227
#>       mean    median       uq      max neval cld
#>  6.6921549 6.6921549 6.759075 6.759075     2   b
#>  0.3317684 0.3317684 0.395714 0.395714     2  a

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

v 0.8.5 performance:

library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)

packageVersion("dplyr")
#> [1] '0.8.5'
microbenchmark::microbenchmark(
  arrange(x, chrom, start, end),
  x[order(x$chrom, x$start, x$end, method = "radix"), ],
  times = 2,
  unit = "s"
)
#> Unit: seconds
#>                                                   expr       min        lq
#>                          arrange(x, chrom, start, end) 0.4096381 0.4096381
#>  x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2582281 0.2582281
#>       mean    median        uq       max neval cld
#>  0.4190501 0.4190501 0.4284621 0.4284621     2   b
#>  0.2600742 0.2600742 0.2619203 0.2619203     2  a

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

kriemo commented 4 years ago

There is also a performance hit with summarize that will not be addressed in v1.0.0. This is really noticeable when using n() within summary functions (e.g. bed_map or bed_merge). We can substitute length() to regain some of the performance losses. The benchmarks vignette should be updated to use bed_map(x, y, .n = length(end)) instead of calling n().

https://github.com/tidyverse/dplyr/issues/5017

library(valr)
library(bench)
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))

# number of intervals
n <- 1e6
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 1010487
y <- bed_random(genome, n = n, seed = seed_y)

mark(bed_map(x, y, .n = n()),
     bed_map(x, y, .n = length(end)),
     iterations = 2)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 bed_map(x, y, .n = n())             6.6s    7.06s     0.142     449MB     6.30
#> 2 bed_map(x, y, .n = length(end))    1.79s    1.85s     0.539     461MB     4.85

^{Created on 2020-03-22 by the reprex package (v0.3.0)}

kriemo commented 4 years ago

New upstream changes are causing multiple test errors due to our custom tbl_ivl and tbl_gnm classes not playing well with bind_rows() . The changes don't seem to be stable yet, so will hold off on trying to fix.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Error: No common type for `..1` <tbl_df<
#>   chrom: character
#>   start: double
#>   end  : double
#> >> and `..2` <tbl_ivl<
#>   chrom: character
#>   start: double
#>   end  : double
#> >>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

^{Created on 2020-04-02 by the reprex package (v0.3.0)}

kriemo commented 4 years ago

Output from newest vctrs. We will need to define some custom functions for vctrs (see https://github.com/r-lib/vctrs/issues/982) in order to keep our tbl_ivl and tbl_gnm tibble subclasses compatible with dplyr.

library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)

x <- tribble(
  ~ chrom, ~ start, ~ end,
  "chr1", 100, 200
)

y <- as.tbl_interval(x)

bind_rows(x, y)
#> Warning: Can't combine <tbl_df> and <tbl_ivl>.
#> ℹ Convert all inputs to the same class to avoid this warning.
#> ℹ See <https://vctrs.r-lib.org/reference/faq-warning-convert-inputs.html>.
#> ℹ Falling back to <data.frame>.
#> Error: Can't convert <tbl_ivl> to <data.frame>.

class(x)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(y)
#> [1] "tbl_ivl"    "tbl_df"     "tbl"        "data.frame"

^{Created on 2020-04-21 by the reprex package (v0.3.0)}

jayhesselberth commented 4 years ago

We could consider dropping tbl_ivl and tbl_gnm entirely, just using tbl_df.

I think the main utility of the tbl_ivl and tbl_gnm classes is to provide a handy way to ensure some basic checks have been done. But we could just run these checks at the beginning of each function instead of checking for the classes.

kriemo commented 4 years ago

That's a good idea and will likely make maintenance easier.

rnabioco / valr

fix build against dev dplyr #353