Open mikelove opened 1 year ago
Hi Mike, I was interested when we talked about that at CSAMA, and would be into trying to improve that as a starter to get into package development. Do you think it's doable for someone who does not have experience in package dev or would it be too hard?
hi Pierre-Paul, it would be great to have your help.
I asked Michael Lawrence about this particular issue at BioC2023, and I think one solution if we just want to operate on metadata (not start
, end
, etc.) is to pull these columns out, operate using some other methods (dplyr, etc.) and then attach back to the GRanges. We would just need to recognize that we are doing a group_by -> mutate command and that we are only referring to metadata columns, not the protected columns.
ok, so pull out metadata as a tibble and use out of the box tidyverse functions? I guess then the variables in group_by
also need to be in the metadata. Should I create a blank repo with methods to test how it works or should I fork the plyranges
repo, modify the methods, see it builds?
Right. You don't need a tibble dependency, you could use data.frame
. The current plyranges Imports are:
methods, dplyr, rlang (>= 0.2.0), magrittr, tidyselect (>= 1.0.0),
rtracklayer, GenomicAlignments, GenomeInfoDb, Rsamtools,
S4Vectors(>= 0.23.10), utils
I think blank repo is a good approach
@ppaxisa please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing
I could use some advice here. Sorry if this is obvious, I'm just trying to understand how plyranges
internals work as a starting point.
I've reused @mikelove's gist
make_rand_gr <- function(N, grps) {
data.frame(
seqnames = sample(c("seq1", "seq2", "seq3"), N, replace = TRUE),
strand = sample(c("+", "-", "*"), N, replace = TRUE),
start = rpois(N,
N),
width = rpois(N, N),
grps = sample(grps, N, replace = TRUE),
score = runif(N)
) %>% as_granges()
}
set.seed(1)
r <- make_rand_gr(10000L, 1:100)
profvis::profvis(
for(i in 1:10) r %>% group_by(grps) %>% mutate(mn = mean(score))
)
when the object is grouped, mutate_grp
first splits the object into a list with S4Vectors::split
and then iterates through each group with lapply
using the mutate_rng
function. Now mutate_rng
deals internally with metadata vs core columns with mutate_mcols
and mutate_core
, respectively.
So I'm wondering, if I should leave most of that structure in place and just work on mutate_mcols
, but we'll still rely on the grouping structure from plyranges
. Or if I should start from the beginning and implement the grouping after conversion to a data.frame
.
If it's the later, it could go like this:
plyranges::mutate
data.frame
. NB: some new metadata columns might be created from core columns so probably need to convert the whole object and not just extract metadata.Does that make sense?
PS: I'm a bit out of depth on the tidy_eval
and data masking used in plyranges
internals notably the overscope functions in mutate_rng
:
mutate_rng <- function(.data, dots) {
col_names <- names(dots)
if (any(col_names %in% "")) {
stop("mutate must have name-variable pairs as input", call. = FALSE)
}
overscope <- overscope_ranges(.data)
.mutated <- overscope_eval_update(overscope, dots)
.data <- mutate_core(.data, .mutated)
mutate_mcols(.data, .mutated)
}
Not sure whether how important this is? But I think I can avoid touching this maybe...
Thanks @ppaxisa this is great.
Or if I should start from the beginning and implement the grouping after conversion to a data.frame.
I think this way, as the lapply
is going to bottleneck us with e.g. 10,000s of genes as the grouping variable. I guess.
some new metadata columns might be created from core columns
why so?
I am also out of my depth with the NSE in plyranges. We should set up a time for @sa-lee and/or @lawremi to walk us through how we can contribute in a way that doesn't break existing code.
Sounds good, thanks for the feedback!
re: need for core columns:
some new metadata columns might be created from core columns why so?
I think some classical use cases would be:
r %>% group_by(seqnames)
to get some metrics on a per chromosome basis for example.r %>% group_by(grp) %>% summarize(mean_interval_length = mean(width))
to know get interval metrics on groups of interest.r %>% group_by(grp, seqnames) %>% summarize(count = n())
to get number of occurrences of each metadata group on each chromosome.if we go towards this I think maybe one useful thing would be to implement a as.data.frame
method for the GroupedGenomicRanges
class. So that when we convert to data.frame
we don't lose the group informations. Currently the conversion wipes out the groupings.
I thought we were going down a path where: 1) grouping hasn't involved a core column, 2) mutate hasn't made use of a core column. And then we are just handling the mutate operation, so then a subsequent summarize
can be handled by original plyranges routines?
well yes I guess that was the original idea but I think it does not increase the complexity too much to use core columns as inputs as long as we don't modify them. We could throw an error if the call tries to modify the core columns. This is roughly what I had in mind:
mutate_mcols <- function(gr, ...) {
df <- as.data.frame(gr)
# pass groupings if needed
if(is(gr, "GroupedGenomicRanges")) {
groups <- group_vars(gr)
df <- group_by(df, !!!syms(groups))
}
dots <- enquos(..., .named = TRUE)
all_cols <- names(dots)
core_cols <- c("start", "end", "width", "seqnames", "strand")
# check new column names are not core columns
if(any(all_cols %in% core_cols)) {
stop("core columns are immutable")
}
df <- mutate(df, !!!dots) %>%
ungroup() %>%
dplyr::select(-all_of(core_cols)) %>%
as("DataFrame")
mcols(gr) <- df
return(gr)
}
One thing I'm not sure about is how to manage groupings when we convert to data.frame. So here I use group_vars
which is implemented in plyranges
to get the groupings and and then re-implement the groupings after the data.frame conversion. This is because the groups attributes structure in the GroupedGenomicRanges
and grouped_df
are not the same, so I thought it would be simpler that way.
Regarding summarize
I think this a bit of special use case because it will not return a GenomicRanges
object. It currently returns a DataFrame
in plyranges. So I don't really see a downside to converting to data.frame
right away and use regular dplyr
. I guess this also goes to how we deal with other verbs, as I here I only treat mutate
.
Does that make sense? please tell me if I am going totally off-road ^^
This makes sense to me, but I'm interested to hear Stuart's thoughts on Tuesday.
Met with Stuart who had some useful comments, we recorded the call so I can share that with you.
Re:
Should I create a blank repo with methods to test how it works or should I fork the plyranges repo, modify the methods, see it builds?
I think now that it should be a fork, because we need to tool around in the plyranges code base to define when to call mutate_mcols()
. Stuart seemed to think we can use rlang for this.
Yes, thanks for the recording, I just listened to it!
Should I fork it to the tidyomics org or to my GitHub?
Your GH is fine I think.
Hi Mike,
I set up the repo here, with a branch to rewrite the mutate as a start, see mutate code.
I have a seemingly weird problem: when I run test, I get no errors:
==> Sourcing R files in 'tests' directory
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: IRanges
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Attaching package: ‘plyranges’
The following object is masked from ‘package:IRanges’:
slice
The following object is masked from ‘package:stats’:
filter
Warning messages:
1: package ‘IRanges’ was built under R version 4.3.1
2: package ‘GenomeInfoDb’ was built under R version 4.3.1
[ FAIL 0 | WARN 9 | SKIP 0 | PASS 390 ]
══ Warnings ════════════════════════════════════════════════════════════════════
── Warning ('test-io-bed.R:49:3'): read_bed returns correct GRanges ────────────
Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(subcols)
# Now:
data %>% select(all_of(subcols))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Backtrace:
▆
1. ├─testthat::expect_identical(...) at test-io-bed.R:49:2
2. │ └─testthat::quasi_label(enquo(object), label, arg = "object")
3. │ └─rlang::eval_bare(expr, quo_get_env(quo))
4. ├─correct_gr %>% mutate(strand = "*") %>% select(subcols)
5. ├─dplyr::select(., subcols)
6. └─plyranges:::select.Ranges(., subcols)
7. └─plyranges:::select_rng(.data, .drop_ranges, ...)
8. ├─base::try(...)
9. │ └─base::tryCatch(...)
10. │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
11. │ └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
12. │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
13. └─tidyselect::eval_select(rlang::expr(c(...)), var_names)
14. └─tidyselect:::eval_select_impl(...)
15. ├─tidyselect:::with_subscript_errors(...)
16. │ └─rlang::try_fetch(...)
17. │ └─base::withCallingHandlers(...)
18. └─tidyselect:::vars_select_eval(...)
19. └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
20. └─tidyselect:::eval_c(expr, data_mask, context_mask)
21. └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
22. └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
23. └─tidyselect:::eval_sym(expr, data_mask, context_mask)
── Warning ('test-io-bed.R:70:3'): read_bed returns correct GRanges ────────────
package ‘rtracklayer’ was built under R version 4.3.1
Backtrace:
▆
1. └─base::library(BSgenome.Hsapiens.UCSC.hg19) at test-io-bed.R:70:2
2. └─base::.getRequiredPackages2(...)
3. └─base::library(...)
4. └─base::.getRequiredPackages2(...)
5. └─base::library(...)
6. └─base (local) testRversion(pkgInfo, package, pkgpath)
── Warning ('test-io-gff.R:114:3'): reading GFF files returns correct GRanges ──
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff3(correct_genome_hg19, test_gff3_out) at test-io-gff.R:114:2
2. ├─rtracklayer::export.gff3(x, file, index = index)
3. └─rtracklayer::export.gff3(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff3", ...)
5. └─BiocIO::export(object, con, "gff3", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:120:3'): reading GFF files returns correct GRanges ──
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff3(correct_gff3, test_gff3_out, index = TRUE) at test-io-gff.R:120:2
2. ├─rtracklayer::export.gff3(x, file, index = index)
3. └─rtracklayer::export.gff3(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff3", ...)
5. └─BiocIO::export(object, con, "gff3", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:135:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff1(correct_gff3, test_gff_out) at test-io-gff.R:135:2
2. ├─rtracklayer::export.gff1(x, file, index = index)
3. └─rtracklayer::export.gff1(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff1", ...)
5. └─BiocIO::export(object, con, "gff1", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:140:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff2(correct_gff3, test_gff_out) at test-io-gff.R:140:2
2. ├─rtracklayer::export.gff2(x, file, index = index)
3. └─rtracklayer::export.gff2(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff2", ...)
5. └─BiocIO::export(object, con, "gff2", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:145:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff3(correct_gff3, test_gff_out) at test-io-gff.R:145:2
2. ├─rtracklayer::export.gff3(x, file, index = index)
3. └─rtracklayer::export.gff3(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff3", ...)
5. └─BiocIO::export(object, con, "gff3", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:151:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff2(correct_gff3, test_gff2_out) at test-io-gff.R:151:2
2. ├─rtracklayer::export.gff2(x, file, index = index)
3. └─rtracklayer::export.gff2(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff2", ...)
5. └─BiocIO::export(object, con, "gff2", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:157:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
contain some CDS with no phase information.
Backtrace:
▆
1. └─plyranges::write_gff1(correct_gff3, test_gff1_out) at test-io-gff.R:157:2
2. ├─rtracklayer::export.gff1(x, file, index = index)
3. └─rtracklayer::export.gff1(x, file, index = index)
4. ├─BiocIO::export(object, con, "gff1", ...)
5. └─BiocIO::export(object, con, "gff1", ...)
6. ├─BiocIO::export(object, FileForFormat(con, format), ...)
7. └─rtracklayer::export(object, FileForFormat(con, format), ...)
8. └─rtracklayer (local) .local(object, con, format, ...)
[ FAIL 0 | WARN 9 | SKIP 0 | PASS 390 ]
Tests complete
but when I run the check, I get an error related to my rewriting of the mutate
function when I have grouped data:
==> R CMD build plyranges
* checking for file ‘plyranges/DESCRIPTION’ ... OK
* preparing ‘plyranges’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘an-introduction.Rmd’ using rmarkdown
Error: processing vignette 'an-introduction.Rmd' failed with diagnostics:
C stack usage 7969952 is too close to the limit
--- failed re-building ‘an-introduction.Rmd’
SUMMARY: processing the following file failed:
‘an-introduction.Rmd’
Error: Vignette re-building failed.
Execution halted
Exited with status 1.
I don't understand why the tests pass but when running the checks it fails. I realise too that I was using the Rstudio GUI for running test and check operations but I could also do it with devtools. It seems devtools offers more flexibility for debugging?
Update: I realise devtools::test
was much more informative and would actually trigger the errors (not sure what's the difference with Rstudio GUI test):
══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 19.4 s
── Failed tests ──────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-colwise.R:51:1): mutating by groups
<CStackOverflowError/stackOverflowError/error/condition>
Error: C stack usage 7969472 is too close to the limit
Error (test-construction.R:63:3): Working with names
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `name = names(.data)`.
Caused by error:
! `name` must be size 3 or 1, not 0.
Backtrace:
▆
1. ├─plyranges::names_to_column(ir) at test-construction.R:63:2
2. │ ├─dplyr::mutate(.data, `:=`(!!var, names(.data))) at plyranges/R/ranges-colwise.R:43:2
3. │ └─plyranges:::mutate.Ranges(.data, `:=`(!!var, names(.data)))
4. │ └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
5. │ └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
6. │ ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
7. │ ├─dplyr::mutate(df, !!!dots)
8. │ └─dplyr:::mutate.data.frame(df, !!!dots)
9. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
10. │ ├─base::withCallingHandlers(...)
11. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
12. │ └─mask$eval_all_mutate(quo)
13. │ └─dplyr (local) eval()
14. ├─methods::as(., "DataFrame")
15. │ └─methods:::.class1(object)
16. ├─dplyr::select(., -all_of(core_cols))
17. ├─dplyr::ungroup(.)
18. ├─dplyr:::dplyr_internal_error(...)
19. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
20. │ └─rlang:::signal_abort(cnd, .file)
21. │ └─base::signalCondition(cnd)
22. └─dplyr (local) `<fn>`(`<dpl:::__>`)
23. └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Error (test-disjoin.R:67:3): matches IRanges/GRanges tests
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `name = Rle(name)`.
Caused by error:
! `name` must be a vector, not a <Rle> object.
Backtrace:
▆
1. ├─disjoin_ranges(gr_by_group) %>% mutate(name = Rle(name)) at test-disjoin.R:67:2
2. ├─dplyr::mutate(., name = Rle(name))
3. ├─plyranges:::mutate.Ranges(., name = Rle(name))
4. │ └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
5. │ └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
6. │ ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
7. │ ├─dplyr::mutate(df, !!!dots)
8. │ └─dplyr:::mutate.data.frame(df, !!!dots)
9. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
10. │ ├─base::withCallingHandlers(...)
11. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
12. │ └─mask$eval_all_mutate(quo)
13. │ └─dplyr (local) eval()
14. ├─methods::as(., "DataFrame")
15. │ └─methods:::.class1(object)
16. ├─dplyr::select(., -all_of(core_cols))
17. ├─dplyr::ungroup(.)
18. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
19. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
20. │ └─rlang:::signal_abort(cnd, .file)
21. │ └─base::signalCondition(cnd)
22. └─dplyr (local) `<fn>`(`<dpl:::__>`)
23. └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Error (test-unnest.R:15:3): expanding makes sense
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `col1 = IntegerList(a = 1, b = c(4, 5), c = c(2, 3))`.
Caused by error:
! `col1` must be a vector, not a <CompressedIntegerList> object.
Backtrace:
▆
1. ├─dplyr::mutate(...) at test-unnest.R:15:2
2. ├─plyranges:::mutate.Ranges(...)
3. │ └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
4. │ └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
5. │ ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
6. │ ├─dplyr::mutate(df, !!!dots)
7. │ └─dplyr:::mutate.data.frame(df, !!!dots)
8. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
9. │ ├─base::withCallingHandlers(...)
10. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
11. │ └─mask$eval_all_mutate(quo)
12. │ └─dplyr (local) eval()
13. ├─methods::as(., "DataFrame")
14. │ └─methods:::.class1(object)
15. ├─dplyr::select(., -all_of(core_cols))
16. ├─dplyr::ungroup(.)
17. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
18. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
19. │ └─rlang:::signal_abort(cnd, .file)
20. │ └─base::signalCondition(cnd)
21. └─dplyr (local) `<fn>`(`<dpl:::__>`)
22. └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
[ FAIL 4 | WARN 7 | SKIP 0 | PASS 366 ]
names_to_column
, it's using mutate
under the hood but with an already defused expression if I understand correctly.CompressedIntegerList
and Rle
objects.Awesome, I will check this out this week. I'm flying tomorrow to a conference, but can pull your repo and poke around during downtime.
I have solved the 2nd error related to name_to_column
and id_to_column
, see last commit.
Regarding the first error of Error: C stack usage 7969472 is too close to the limit
, it seems to come from the fact that GroupedIntegerRanges
does not inherit from IRanges
and does not have a set method for as.data.frame
. It should be an easy fix if we have GroupedIntegerRanges
inherit from IRanges
no?
library(plyranges)
ir1 <- IRanges(start = 1:10, width = 5)
class(ir1)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
class(ir1 %>% group_by(grp))
## [1] "GroupedIntegerRanges"
## attr(,"package")
## [1] "plyranges"
is(ir1 %>% group_by(grp), "IRanges")
## [1] FALSE
inherits(ir1 %>% group_by(grp), "IRanges")
## [1] FALSE
ir1 %>% group_by(grp) %>% as.data.frame()
## Error: C stack usage 7971520 is too close to the limit
Update: I think the simplest solution is to ungroup
before converting to data.frame. This removes the error, see commit
Hi Mike,
I've done some reading on the 2 last error checks that pertain to mutate
with CompressedIntegerList
and Rle
objects. If I understand correctly those are vector-like objects from S4vectors
but not supported by vanilla-tidyverse. So I'm wondering if we've reached a dead-end because using vanilla-tidyverse would be a breaking change in these situations. The only vague idea I would have would be to check the class and if S4vectors
use the old plyranges::mutate
syntax, otherwise use vanilla-tidyverse.
Sorry for the silence, past two weeks I was traveling a lot.
This makes sense to me, if a user wants to do mutate on a GroupedGenomicRanges, we can offer a big speed-up when operating on non-core columns, so long as these play well with dplyr (not classes defined in S4Vectors). This seems reasonable for now, and will speed up many operations. Our motivating example was working with transcripts/exons and wanting to group by gene, and from TxDb or EnsDb, you get back character columns.
@ppaxisa sorry for the delay again, been a busy Fall as I'm starting to teach a new module. I'm happy to come through and start testing if that would be useful?
Yes! But I have not submitted a commit to drop the tests on the S4vectors
objects that were generating the last 2 build errors. If you can comment those out, the package should build. I'm on vacation now so won't touch the repo til I get back in ~10 days.
Hi Mike, sorry for the slow turn around. I've made some commits to solve the last test errors: basically coercing all S4Vectors
list-like objects to base::list to avoir errors on some tests. Also fixed some small issues. This is what I get when running devtools::check
==> devtools::check()
══ Documenting ═════════════════════════════════════════════════════════════════════════════════════════════════════
ℹ Installed roxygen2 version (7.2.3) doesn't match required (7.1.0)
✖ `check()` will not re-document this package
══ Building ════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• CFLAGS : -Wall -pedantic -fdiagnostics-color=always
• CXXFLAGS : -Wall -pedantic -fdiagnostics-color=always
• CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX14FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX17FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX20FLAGS: -Wall -pedantic -fdiagnostics-color=always
── R CMD build ─────────────────────────────────────────────────────────────────────────────────────────────────────
✔ checking for file ‘/Users/pierre-paul.axisa/PPA_misc/plyranges/DESCRIPTION’ ...
─ preparing ‘plyranges’:
✔ checking DESCRIPTION meta-information ...
─ installing the package to build vignettes
✔ creating vignettes (29.5s)
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Removed empty directory ‘plyranges/pkgdown’
Removed empty directory ‘plyranges/tests/testthat/_snaps’
─ building ‘plyranges_1.13.1.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'
Warning: invalid gid value replaced by that for user 'nobody'
══ Checking ════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE
• _R_CHECK_CRAN_INCOMING_ : FALSE
• _R_CHECK_FORCE_SUGGESTS_ : FALSE
• _R_CHECK_PACKAGES_USED_IGNORE_UNUSED_IMPORTS_: FALSE
• NOT_CRAN : true
── R CMD check ─────────────────────────────────────────────────────────────────────────────────────────────────────
─ using log directory ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck’
─ using R version 4.3.0 (2023-04-21)
─ using platform: x86_64-apple-darwin20 (64-bit)
─ R was compiled by
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
GNU Fortran (GCC) 12.2.0
─ running under: macOS Monterey 12.5
─ using session charset: UTF-8
─ using options ‘--no-manual --as-cran’ (413ms)
✔ checking for file ‘plyranges/DESCRIPTION’
─ checking extension type ... Package
─ this is package ‘plyranges’ version ‘1.13.1’
─ package encoding: UTF-8
✔ checking package namespace information ...
✔ checking package dependencies (656ms)
✔ checking if this is a source package
✔ checking if there is a namespace
✔ checking for executable files (753ms)
✔ checking for hidden files and directories
✔ checking for portable file names ...
✔ checking for sufficient/correct file permissions
─ checking whether package ‘plyranges’ can be installed ... [20s/20s] WARNING (20.5s)
Found the following significant warnings:
Warning: package ‘IRanges’ was built under R version 4.3.1
Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
See ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00install.out’ for details.
✔ checking installed package size ...
✔ checking package directory ...
✔ checking for future file timestamps ...
✔ checking ‘build’ directory
✔ checking DESCRIPTION meta-information ...
✔ checking top-level files ...
✔ checking for left-over files ...
✔ checking index information ...
✔ checking package subdirectories ...
✔ checking R files for non-ASCII characters ...
✔ checking R files for syntax errors ...
✔ checking whether the package can be loaded (5.4s)
✔ checking whether the package can be loaded with stated dependencies (5.4s)
✔ checking whether the package can be unloaded cleanly (5.4s)
✔ checking whether the namespace can be loaded with stated dependencies (5.2s)
✔ checking whether the namespace can be unloaded cleanly (5.4s)
✔ checking loading without being on the library search path (5.4s)
✔ checking dependencies in R code (10.9s)
W checking S3 generic/method consistency (5.4s)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedGenomicRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedIntegerRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.Ranges:
function(.data, ..., keep)
group_keys:
function(.tbl, ...)
group_keys.Ranges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedGenomicRanges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedIntegerRanges:
function(.data, ...)
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
✔ checking replacement functions (5.4s)
✔ checking foreign function calls (5.5s)
─ checking R code for possible problems ... [26s/26s] OK (26.3s)
✔ checking Rd files (366ms)
✔ checking Rd metadata ...
✔ checking Rd line widths ...
W checking Rd cross-references ...
Missing link or links in documentation object 'ranges-info.Rd':
‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’
See section 'Cross-references' in the 'Writing R Extensions' manual.
✔ checking for missing documentation entries (5.4s)
✔ checking for code/documentation mismatches (16.8s)
✔ checking Rd \usage sections (5.7s)
✔ checking Rd contents ...
✔ checking for unstated dependencies in examples ...
✔ checking installed files from ‘inst/doc’ ...
✔ checking files in ‘vignettes’ ...
─ checking examples ... [28s/28s] OK (28.1s)
✔ checking for unstated dependencies in ‘tests’ ...
─ checking tests ...
[25s/25s] OKthat.R’
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking re-building of vignette outputs ... OK
* checking for non-standard things in the check directory ... OK
* checking for detritus in the temp directory ... OK
* DONE
Status: 3 WARNINGs
See
‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00check.log’
for details.
── R CMD check results ─────────────────────────────────────────────────────────────────────── plyranges 1.13.1 ────
Duration: 3m 22.5s
❯ checking whether package ‘plyranges’ can be installed ... [20s/20s] WARNING
See below...
❯ checking S3 generic/method consistency ... WARNING
group_split:
function(.tbl, ..., .keep)
group_split.GroupedGenomicRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedIntegerRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.Ranges:
function(.data, ..., keep)
group_keys:
function(.tbl, ...)
group_keys.Ranges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedGenomicRanges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedIntegerRanges:
function(.data, ...)
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
❯ checking Rd cross-references ... WARNING
Missing link or links in documentation object 'ranges-info.Rd':
‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’
See section 'Cross-references' in the 'Writing R Extensions' manual.
0 errors ✔ | 3 warnings ✖ | 0 notes ✔
Error: R CMD check found WARNINGs
Execution halted
Exited with status 1.
PS: I did not manage to regenerate the documentation so I fixed some code in examples Rd files manually.
Not sure why exit status is 1 if there are only warnings. Is that normal?
I think CHECK does exit status 1 for both warnings and errors.
So these generics and methods have different signature, did that pre-exist your additions?
Remind me, in the expand_ranges
example there is an ungrouped GRanges, why did this have to change the mutate call from IntegerList to list?
RE generics and methods, yes all of those existed before and I did not touch it.
RE expand_ranges
, and any mutate
event really in my implementation, the mutate call gets passed early to dplyr::mutate
and which does not handle S4Vectors
classes. That is how I wrote it anyway. I thought that even without any groupings, things like arithmetic operations would go faster in vanilla dplyr... That said, I think I can also twist things backwards and only use my implementation when dealing with a grouped object. lmk
re: generics and methods, ok so we can PR regardless of those warnings i think.
Let's check in with @sa-lee on that because I think the last time I spoke with him we were thinking of a fork:
/ ranges --> standard plyranges
mutate() is called -->
\ grouped ranges --> our new approach leveraging dplyr
...because from what I remember, mutate on ungrouped ranges wasn't so bad, not nearly as bad as mutate on grouped ranges, which was getting passed to lapply. Stuart even optimized the former IIRC.
Sounds good. Also I wanted to run the coercion back to DataFrame
past you:
mutate_mcols <- function(df, dots) {
core_cols <- intersect(colnames(df),c("start", "end", "width", "seqnames", "strand"))
df <- mutate(df, !!!dots) %>%
ungroup() %>%
dplyr::select(-tidyselect::all_of(core_cols)) %>%
as("DataFrame")
}
I'm just putting a as("DataFrame")
when finished with vanilla Tidyverse operations, so I can put the meta data back into the ranges object. But I'm not sure what needs to be checked when converting between data.frame/tibble
and DataFrame
...
Ok I should be able to check today
November slipped away from me. @ppaxisa what would it take to scale back from modification of all mutate
events to only modification of mutate
on grouped GRanges?
Hi Mike, I don't think too much work: I rewrote the 2 internal functions to be "tidyverse-dependent". But I could bring back the original ones and split between original vs new functions when dealing with non-grouped vs grouped objects. I'm a bit busy with faculty applications rn, but I'll try to push a commit over the break if I get a moment.
Ok let’s loop back later.
Hi Mike, here I am! Finally found some time to roll back some changes so we use the original plyranges::mutate
functions when dealing with ungrouped ranges vs vanilla dplyr::mutate
when dealing with grouped ranges. I've moved a few things here and there in the original functions but it's basically doing the same thing as the original.
Here is the devtools::check
result:
==> devtools::check()
══ Documenting ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
ℹ Updating plyranges documentation
ℹ Loading plyranges
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: IRanges
Warning: package ‘IRanges’ was built under R version 4.3.1
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
Warning: replacing previous import ‘dplyr::intersect’ by ‘GenomicRanges::intersect’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::setdiff’ by ‘GenomicRanges::setdiff’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::union’ by ‘GenomicRanges::union’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::first’ by ‘S4Vectors::first’ when loading ‘plyranges’
Writing NAMESPACE
Writing NAMESPACE
Writing ranges-expand.Rd
══ Building ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• CFLAGS : -Wall -pedantic -fdiagnostics-color=always
• CXXFLAGS : -Wall -pedantic -fdiagnostics-color=always
• CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX14FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX17FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX20FLAGS: -Wall -pedantic -fdiagnostics-color=always
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔ checking for file ‘/Users/pierre-paul.axisa/PPA_misc/plyranges/DESCRIPTION’ ...
─ preparing ‘plyranges’: (403ms)
✔ checking DESCRIPTION meta-information ...
─ installing the package to build vignettes
✔ creating vignettes (30.6s)
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Removed empty directory ‘plyranges/pkgdown’
Removed empty directory ‘plyranges/tests/testthat/_snaps’
─ building ‘plyranges_1.13.1.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'
Warning: invalid gid value replaced by that for user 'nobody'
══ Checking ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE
• _R_CHECK_CRAN_INCOMING_ : FALSE
• _R_CHECK_FORCE_SUGGESTS_ : FALSE
• _R_CHECK_PACKAGES_USED_IGNORE_UNUSED_IMPORTS_: FALSE
• NOT_CRAN : true
── R CMD check ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─ using log directory ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck’
─ using R version 4.3.0 (2023-04-21)
─ using platform: x86_64-apple-darwin20 (64-bit)
─ R was compiled by
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
GNU Fortran (GCC) 12.2.0
─ running under: macOS Monterey 12.5
─ using session charset: UTF-8
─ using options ‘--no-manual --as-cran’ (341ms)
✔ checking for file ‘plyranges/DESCRIPTION’
─ checking extension type ... Package
─ this is package ‘plyranges’ version ‘1.13.1’
─ package encoding: UTF-8
✔ checking package namespace information ...
✔ checking package dependencies (911ms)
✔ checking if this is a source package
✔ checking if there is a namespace
✔ checking for executable files (782ms)
✔ checking for hidden files and directories
✔ checking for portable file names ...
✔ checking for sufficient/correct file permissions ...
─ checking whether package ‘plyranges’ can be installed ... [20s/21s] WARNING (20.7s)
Found the following significant warnings:
Warning: package ‘IRanges’ was built under R version 4.3.1
Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
Warning: replacing previous import ‘dplyr::intersect’ by ‘GenomicRanges::intersect’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::setdiff’ by ‘GenomicRanges::setdiff’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::union’ by ‘GenomicRanges::union’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::first’ by ‘S4Vectors::first’ when loading ‘plyranges’
See ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00install.out’ for details.
✔ checking installed package size ...
✔ checking package directory ...
✔ checking for future file timestamps ...
✔ checking ‘build’ directory
✔ checking DESCRIPTION meta-information ...
✔ checking top-level files
✔ checking for left-over files ...
✔ checking index information ...
✔ checking package subdirectories ...
✔ checking R files for non-ASCII characters ...
✔ checking R files for syntax errors ...
✔ checking whether the package can be loaded (5.5s)
✔ checking whether the package can be loaded with stated dependencies (5.4s)
✔ checking whether the package can be unloaded cleanly (5.5s)
✔ checking whether the namespace can be loaded with stated dependencies (5.6s)
✔ checking whether the namespace can be unloaded cleanly (5.7s)
✔ checking loading without being on the library search path (5.5s)
✔ checking dependencies in R code (11.2s)
W checking S3 generic/method consistency (6s)
group_keys:
function(.tbl, ...)
group_keys.Ranges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedGenomicRanges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedIntegerRanges:
function(.data, ...)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedGenomicRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedIntegerRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.Ranges:
function(.data, ..., keep)
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
✔ checking replacement functions (5.8s)
✔ checking foreign function calls (5.3s)
─ checking R code for possible problems ... [27s/27s] OK (26.9s)
✔ checking Rd files (363ms)
✔ checking Rd metadata ...
✔ checking Rd line widths ...
W checking Rd cross-references ...
Missing link or links in documentation object 'ranges-info.Rd':
‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’
See section 'Cross-references' in the 'Writing R Extensions' manual.
✔ checking for missing documentation entries (5.4s)
✔ checking for code/documentation mismatches (16.8s)
✔ checking Rd \usage sections (5.9s)
✔ checking Rd contents ...
✔ checking for unstated dependencies in examples ...
✔ checking installed files from ‘inst/doc’ ...
✔ checking files in ‘vignettes’ ...
─ checking examples ... [27s/27s] OK (27.7s)
✔ checking for unstated dependencies in ‘tests’ ...
─ checking tests ...
[25s/25s] OKthat.R’
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking re-building of vignette outputs ... [10s/11s] OK
* checking for non-standard things in the check directory ... OK
* checking for detritus in the temp directory ... OK
* DONE
Status: 3 WARNINGs
See
‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00check.log’
for details.
── R CMD check results ───────────────────────────────────────────────────────────────────────────────────────────────────────── plyranges 1.13.1 ────
Duration: 3m 26.2s
❯ checking whether package ‘plyranges’ can be installed ... [20s/21s] WARNING
See below...
❯ checking S3 generic/method consistency ... WARNING
group_keys:
function(.tbl, ...)
group_keys.Ranges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedGenomicRanges:
function(.data, ...)
group_keys:
function(.tbl, ...)
group_keys.GroupedIntegerRanges:
function(.data, ...)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedGenomicRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.GroupedIntegerRanges:
function(.data, ..., keep)
group_split:
function(.tbl, ..., .keep)
group_split.Ranges:
function(.data, ..., keep)
See section ‘Generic functions and methods’ in the ‘Writing R
Extensions’ manual.
❯ checking Rd cross-references ... WARNING
Missing link or links in documentation object 'ranges-info.Rd':
‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’
See section 'Cross-references' in the 'Writing R Extensions' manual.
0 errors ✔ | 3 warnings ✖ | 0 notes ✔
Error: R CMD check found WARNINGs
Execution halted
Exited with status 1.
Still a few old warnings (it was already there before).
I've branched this version at https://github.com/ImmuneAxisa/plyranges/tree/mutate_grp_revamp
awesome! I will check this out right now.
This is fantastic, I tried out the test example at the top of the issue. With the Bioc version:
expr min lq mean median uq max neval
plyra 607.711184 614.211119 619.617477 615.13423 621.733553 639.297297 5
dplyr 3.620136 3.628664 3.848752 3.63711 4.114227 4.243623 5
And your new code:
expr min lq mean median uq max neval
plyra 28.33633 28.768101 35.222370 31.225723 40.978885 46.802812 5
dplyr 5.40011 5.515771 5.544783 5.559354 5.623847 5.624831 5
20x speedup!
!> all.equal(x,y)
[1] TRUE
I've got a PR ready for plyranges also, would be great to get these in before April.
Great to hear! I'm again not familiar with this. What roadblocks do we need to clear to get the PR ready? Do we need to get rid of all warnings?
also, here is a little illustration of the internal function now:
Generated with
library(pkgnet)
library(tidyverse)
library(ggraph)
library(tidygraph)
report<- CreatePackageReport('plyranges')
report$FunctionReporter$pkg_graph$igraph %>%
as_tbl_graph() %>%
activate(nodes) %>%
dplyr::filter(grepl("mutate", name) & !grepl("Deferred", name)) %>%
mutate(type = ifelse(grepl("\\.", name), "methods", "internal")) %>%
ggraph(layout = "dh") +
geom_edge_link(arrow = arrow(length = unit(10, 'mm')),
end_cap = circle(10, 'mm')) +
geom_node_label(aes(label = name, fill = type)) +
scale_x_continuous(expand = expansion(0.3))
Let’s get this in! Wanna PR?
separately, do you think I can solve the S3 generic/method consistency issue by changing .data to .tbl? I’ll work on that in the new devel branch.
I can submit a PR but it says "This branch is 10 commits ahead of, 22 commits behind tidyomics/plyranges:devel.
Should I sync my repo first to the latest devel version?
I have mixed feelings about this method consistency: it's confusing to pretend that the input to the function is a tibble when it's not. But if warnings are not allowed to maintain the package on bioconductor, I guess we need to tend to it...
I think you could sync or not. I don’t think that matters for the PR.
I’ll investigate the warnings more this week:
group_by
followed by summarization in plyranges is not as fast as dplyr on tibblehttps://github.com/sa-lee/plyranges/issues/30
However, an even more important issue is that
group_by
without summarization is much slowerhttps://gist.github.com/mikelove/d788831af3cf76de642ba03af7a0124b
Discovered in this context
https://github.com/tidyomics/tidy-ranges-tutorial/blob/main/isoform-analysis.Rmd#L86-L90