Closed kriemo closed 4 years ago
Most of the errors are related to testthat:test_equal
no longer considering a trbl_interval()
object equal to a tibble::tribble()
, due to differing class attributes. Substituting testthat::expect_equivalent()
gets around this issue. The alternative fix is mentioned in #274, but that would be a big breaking change that I am not in favor of at this point.
library(testthat)
library(valr)
x <- tibble::tribble(
~ chrom, ~ start, ~ end,
"chr1", 1000, 2000,
"chr1", 1000, 400
)
pred <- tibble::tribble(
~ chrom, ~ start, ~ end,
"chr1", 1000, 400,
"chr1", 1000, 2000
)
res <- bed_sort(x)
expect_equal(res, pred)
#> Error: `res` not equal to `pred`.
#> Attributes: < Length mismatch: comparison on first 2 components >
#> Attributes: < Component "class": Lengths (4, 3) differ (string compare on first 3) >
#> Attributes: < Component "class": 3 string mismatches >
class(res)
#> [1] "tbl_ivl" "tbl_df" "tbl" "data.frame"
class(pred)
#> [1] "tbl_df" "tbl" "data.frame"
pred == res
#> chrom start end
#> [1,] TRUE TRUE TRUE
#> [2,] TRUE TRUE TRUE
expect_equivalent(res, pred)
Created on 2020-03-20 by the reprex package (v0.3.0)
arrange
has a performance regression (https://github.com/tidyverse/dplyr/issues/4962) in the dev version of dplyr which will slow down many of valr's functions. The regression will likely be fixed before being released as v1.0.0. However testing on my end suggests that just using base R order
with the radix
sorting method is faster than CRAN dplyr, (or dev dplyr) by ~25-40%. I suggest we modify bed_sort to use base R unless arrange
is fundamentally rewritten in v1.0.0.
Dev dplyr performance:
library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
packageVersion("dplyr")
#> [1] '0.8.99.9002'
microbenchmark::microbenchmark(
arrange(x, chrom, start, end),
x[order(x$chrom, x$start, x$end, method = "radix"), ],
times = 2,
unit = "s"
)
#> Unit: seconds
#> expr min lq
#> arrange(x, chrom, start, end) 6.6252349 6.6252349
#> x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2678227 0.2678227
#> mean median uq max neval cld
#> 6.6921549 6.6921549 6.759075 6.759075 2 b
#> 0.3317684 0.3317684 0.395714 0.395714 2 a
Created on 2020-03-21 by the reprex package (v0.3.0)
v 0.8.5 performance:
library(valr)
library(dplyr, warn.conflicts = FALSE)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
# number of intervals
n <- 1e7
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
packageVersion("dplyr")
#> [1] '0.8.5'
microbenchmark::microbenchmark(
arrange(x, chrom, start, end),
x[order(x$chrom, x$start, x$end, method = "radix"), ],
times = 2,
unit = "s"
)
#> Unit: seconds
#> expr min lq
#> arrange(x, chrom, start, end) 0.4096381 0.4096381
#> x[order(x$chrom, x$start, x$end, method = "radix"), ] 0.2582281 0.2582281
#> mean median uq max neval cld
#> 0.4190501 0.4190501 0.4284621 0.4284621 2 b
#> 0.2600742 0.2600742 0.2619203 0.2619203 2 a
Created on 2020-03-21 by the reprex package (v0.3.0)
There is also a performance hit with summarize that will not be addressed in v1.0.0. This is really noticeable when using n()
within summary functions (e.g. bed_map
or bed_merge
). We can substitute length()
to regain some of the performance losses. The benchmarks vignette should be updated to use bed_map(x, y, .n = length(end))
instead of calling n()
.
https://github.com/tidyverse/dplyr/issues/5017
library(valr)
library(bench)
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
# number of intervals
n <- 1e6
seed_x <- 1010486
x <- bed_random(genome, n = n, seed = seed_x)
seed_y <- 1010487
y <- bed_random(genome, n = n, seed = seed_y)
mark(bed_map(x, y, .n = n()),
bed_map(x, y, .n = length(end)),
iterations = 2)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 bed_map(x, y, .n = n()) 6.6s 7.06s 0.142 449MB 6.30
#> 2 bed_map(x, y, .n = length(end)) 1.79s 1.85s 0.539 461MB 4.85
Created on 2020-03-22 by the reprex package (v0.3.0)
New upstream changes are causing multiple test errors due to our custom tbl_ivl
and tbl_gnm
classes not playing well with bind_rows()
. The changes don't seem to be stable yet, so will hold off on trying to fix.
library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)
x <- tribble(
~ chrom, ~ start, ~ end,
"chr1", 100, 200
)
y <- as.tbl_interval(x)
bind_rows(x, y)
#> Error: No common type for `..1` <tbl_df<
#> chrom: character
#> start: double
#> end : double
#> >> and `..2` <tbl_ivl<
#> chrom: character
#> start: double
#> end : double
#> >>.
class(x)
#> [1] "tbl_df" "tbl" "data.frame"
class(y)
#> [1] "tbl_ivl" "tbl_df" "tbl" "data.frame"
Created on 2020-04-02 by the reprex package (v0.3.0)
Output from newest vctrs. We will need to define some custom functions for vctrs (see https://github.com/r-lib/vctrs/issues/982) in order to keep our tbl_ivl
and tbl_gnm
tibble subclasses compatible with dplyr
.
library(tibble)
library(dplyr, warn.conflicts = F)
library(valr)
x <- tribble(
~ chrom, ~ start, ~ end,
"chr1", 100, 200
)
y <- as.tbl_interval(x)
bind_rows(x, y)
#> Warning: Can't combine <tbl_df> and <tbl_ivl>.
#> ℹ Convert all inputs to the same class to avoid this warning.
#> ℹ See <https://vctrs.r-lib.org/reference/faq-warning-convert-inputs.html>.
#> ℹ Falling back to <data.frame>.
#> Error: Can't convert <tbl_ivl> to <data.frame>.
class(x)
#> [1] "tbl_df" "tbl" "data.frame"
class(y)
#> [1] "tbl_ivl" "tbl_df" "tbl" "data.frame"
Created on 2020-04-21 by the reprex package (v0.3.0)
We could consider dropping tbl_ivl
and tbl_gnm
entirely, just using tbl_df
.
I think the main utility of the tbl_ivl
and tbl_gnm
classes is to provide a handy way to ensure some basic checks have been done. But we could just run these checks at the beginning of each function instead of checking for the classes.
That's a good idea and will likely make maintenance easier.