Speed up `group_by` operations in `plyranges`

mikelove commented 1 year ago

group_by followed by summarization in plyranges is not as fast as dplyr on tibble

https://github.com/sa-lee/plyranges/issues/30

However, an even more important issue is that group_by without summarization is much slower

https://gist.github.com/mikelove/d788831af3cf76de642ba03af7a0124b

Discovered in this context

https://github.com/tidyomics/tidy-ranges-tutorial/blob/main/isoform-analysis.Rmd#L86-L90

ppaxisa commented 1 year ago

Hi Mike, I was interested when we talked about that at CSAMA, and would be into trying to improve that as a starter to get into package development. Do you think it's doable for someone who does not have experience in package dev or would it be too hard?

mikelove commented 1 year ago

hi Pierre-Paul, it would be great to have your help.

I asked Michael Lawrence about this particular issue at BioC2023, and I think one solution if we just want to operate on metadata (not start, end, etc.) is to pull these columns out, operate using some other methods (dplyr, etc.) and then attach back to the GRanges. We would just need to recognize that we are doing a group_by -> mutate command and that we are only referring to metadata columns, not the protected columns.

ppaxisa commented 1 year ago

ok, so pull out metadata as a tibble and use out of the box tidyverse functions? I guess then the variables in group_by also need to be in the metadata. Should I create a blank repo with methods to test how it works or should I fork the plyranges repo, modify the methods, see it builds?

mikelove commented 1 year ago

Right. You don't need a tibble dependency, you could use data.frame. The current plyranges Imports are:

methods, dplyr, rlang (>= 0.2.0), magrittr, tidyselect (>= 1.0.0), 
rtracklayer, GenomicAlignments, GenomeInfoDb, Rsamtools, 
S4Vectors(>= 0.23.10), utils

I think blank repo is a good approach

stemangiola commented 1 year ago

@ppaxisa please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing

ppaxisa commented 1 year ago

I could use some advice here. Sorry if this is obvious, I'm just trying to understand how plyranges internals work as a starting point.

I've reused @mikelove's gist

make_rand_gr <- function(N, grps) {
  data.frame(
    seqnames = sample(c("seq1", "seq2", "seq3"), N, replace = TRUE),
    strand = sample(c("+", "-", "*"), N, replace = TRUE),
    start = rpois(N,
                  N),
    width = rpois(N, N),
    grps = sample(grps, N, replace = TRUE),
    score = runif(N)
  ) %>% as_granges()
}

set.seed(1)

r <- make_rand_gr(10000L, 1:100) 

profvis::profvis(
  for(i in 1:10) r %>% group_by(grps) %>% mutate(mn = mean(score))
)

when the object is grouped, mutate_grp first splits the object into a list with S4Vectors::split and then iterates through each group with lapply using the mutate_rng function. Now mutate_rng deals internally with metadata vs core columns with mutate_mcols and mutate_core, respectively.

So I'm wondering, if I should leave most of that structure in place and just work on mutate_mcols, but we'll still rely on the grouping structure from plyranges. Or if I should start from the beginning and implement the grouping after conversion to a data.frame.

If it's the later, it could go like this:

check if columns to mutate contain core columns, if yes rely on plyranges::mutate
then work on metadata columns by converting object to data.frame. NB: some new metadata columns might be created from core columns so probably need to convert the whole object and not just extract metadata.

Does that make sense?

PS: I'm a bit out of depth on the tidy_eval and data masking used in plyranges internals notably the overscope functions in mutate_rng:

mutate_rng <- function(.data, dots) {
  col_names <- names(dots)
  if (any(col_names %in% "")) {
    stop("mutate must have name-variable pairs as input", call. = FALSE)
  }

  overscope <- overscope_ranges(.data)
  .mutated <- overscope_eval_update(overscope, dots)
  .data <- mutate_core(.data, .mutated)
  mutate_mcols(.data, .mutated)
}

Not sure whether how important this is? But I think I can avoid touching this maybe...

mikelove commented 1 year ago

Thanks @ppaxisa this is great.

Or if I should start from the beginning and implement the grouping after conversion to a data.frame.

I think this way, as the lapply is going to bottleneck us with e.g. 10,000s of genes as the grouping variable. I guess.

some new metadata columns might be created from core columns

why so?

I am also out of my depth with the NSE in plyranges. We should set up a time for @sa-lee and/or @lawremi to walk us through how we can contribute in a way that doesn't break existing code.

ppaxisa commented 1 year ago

Sounds good, thanks for the feedback!

re: need for core columns:

some new metadata columns might be created from core columns why so?

I think some classical use cases would be:

r %>% group_by(seqnames) to get some metrics on a per chromosome basis for example.
r %>% group_by(grp) %>% summarize(mean_interval_length = mean(width)) to know get interval metrics on groups of interest.
r %>% group_by(grp, seqnames) %>% summarize(count = n()) to get number of occurrences of each metadata group on each chromosome.

if we go towards this I think maybe one useful thing would be to implement a as.data.frame method for the GroupedGenomicRanges class. So that when we convert to data.frame we don't lose the group informations. Currently the conversion wipes out the groupings.

mikelove commented 1 year ago

I thought we were going down a path where: 1) grouping hasn't involved a core column, 2) mutate hasn't made use of a core column. And then we are just handling the mutate operation, so then a subsequent summarize can be handled by original plyranges routines?

ppaxisa commented 1 year ago

well yes I guess that was the original idea but I think it does not increase the complexity too much to use core columns as inputs as long as we don't modify them. We could throw an error if the call tries to modify the core columns. This is roughly what I had in mind:

mutate_mcols <- function(gr, ...) {

  df <- as.data.frame(gr)

  # pass groupings if needed
  if(is(gr, "GroupedGenomicRanges")) {

    groups <- group_vars(gr)

    df <- group_by(df, !!!syms(groups))
  }

  dots <- enquos(..., .named = TRUE)

  all_cols <- names(dots) 
  core_cols <- c("start", "end", "width", "seqnames", "strand")

  # check new column names are not core columns
  if(any(all_cols %in% core_cols)) {
    stop("core columns are immutable")
  }

  df <- mutate(df, !!!dots) %>%
    ungroup() %>%
    dplyr::select(-all_of(core_cols)) %>%
    as("DataFrame")

  mcols(gr) <- df

  return(gr)
}

One thing I'm not sure about is how to manage groupings when we convert to data.frame. So here I use group_vars which is implemented in plyranges to get the groupings and and then re-implement the groupings after the data.frame conversion. This is because the groups attributes structure in the GroupedGenomicRanges and grouped_df are not the same, so I thought it would be simpler that way.

Regarding summarize I think this a bit of special use case because it will not return a GenomicRanges object. It currently returns a DataFrame in plyranges. So I don't really see a downside to converting to data.frame right away and use regular dplyr. I guess this also goes to how we deal with other verbs, as I here I only treat mutate.

Does that make sense? please tell me if I am going totally off-road ^^

mikelove commented 1 year ago

This makes sense to me, but I'm interested to hear Stuart's thoughts on Tuesday.

mikelove commented 1 year ago

Met with Stuart who had some useful comments, we recorded the call so I can share that with you.

Re:

Should I create a blank repo with methods to test how it works or should I fork the plyranges repo, modify the methods, see it builds?

I think now that it should be a fork, because we need to tool around in the plyranges code base to define when to call mutate_mcols(). Stuart seemed to think we can use rlang for this.

ppaxisa commented 1 year ago

Yes, thanks for the recording, I just listened to it!

Should I fork it to the tidyomics org or to my GitHub?

mikelove commented 1 year ago

Your GH is fine I think.

ppaxisa commented 1 year ago

Hi Mike,

I set up the repo here, with a branch to rewrite the mutate as a start, see mutate code.

I have a seemingly weird problem: when I run test, I get no errors:

==> Sourcing R files in 'tests' directory

Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: IRanges
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

    findMatches

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb

Attaching package: ‘plyranges’

The following object is masked from ‘package:IRanges’:

    slice

The following object is masked from ‘package:stats’:

    filter

Warning messages:
1: package ‘IRanges’ was built under R version 4.3.1 
2: package ‘GenomeInfoDb’ was built under R version 4.3.1 
[ FAIL 0 | WARN 9 | SKIP 0 | PASS 390 ]

══ Warnings ════════════════════════════════════════════════════════════════════
── Warning ('test-io-bed.R:49:3'): read_bed returns correct GRanges ────────────
Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(subcols)

  # Now:
  data %>% select(all_of(subcols))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Backtrace:
     ▆
  1. ├─testthat::expect_identical(...) at test-io-bed.R:49:2
  2. │ └─testthat::quasi_label(enquo(object), label, arg = "object")
  3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
  4. ├─correct_gr %>% mutate(strand = "*") %>% select(subcols)
  5. ├─dplyr::select(., subcols)
  6. └─plyranges:::select.Ranges(., subcols)
  7.   └─plyranges:::select_rng(.data, .drop_ranges, ...)
  8.     ├─base::try(...)
  9.     │ └─base::tryCatch(...)
 10.     │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 11.     │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 12.     │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 13.     └─tidyselect::eval_select(rlang::expr(c(...)), var_names)
 14.       └─tidyselect:::eval_select_impl(...)
 15.         ├─tidyselect:::with_subscript_errors(...)
 16.         │ └─rlang::try_fetch(...)
 17.         │   └─base::withCallingHandlers(...)
 18.         └─tidyselect:::vars_select_eval(...)
 19.           └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
 20.             └─tidyselect:::eval_c(expr, data_mask, context_mask)
 21.               └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
 22.                 └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
 23.                   └─tidyselect:::eval_sym(expr, data_mask, context_mask)
── Warning ('test-io-bed.R:70:3'): read_bed returns correct GRanges ────────────
package ‘rtracklayer’ was built under R version 4.3.1
Backtrace:
    ▆
 1. └─base::library(BSgenome.Hsapiens.UCSC.hg19) at test-io-bed.R:70:2
 2.   └─base::.getRequiredPackages2(...)
 3.     └─base::library(...)
 4.       └─base::.getRequiredPackages2(...)
 5.         └─base::library(...)
 6.           └─base (local) testRversion(pkgInfo, package, pkgpath)
── Warning ('test-io-gff.R:114:3'): reading GFF files returns correct GRanges ──
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff3(correct_genome_hg19, test_gff3_out) at test-io-gff.R:114:2
 2.   ├─rtracklayer::export.gff3(x, file, index = index)
 3.   └─rtracklayer::export.gff3(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff3", ...)
 5.     └─BiocIO::export(object, con, "gff3", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:120:3'): reading GFF files returns correct GRanges ──
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff3(correct_gff3, test_gff3_out, index = TRUE) at test-io-gff.R:120:2
 2.   ├─rtracklayer::export.gff3(x, file, index = index)
 3.   └─rtracklayer::export.gff3(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff3", ...)
 5.     └─BiocIO::export(object, con, "gff3", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:135:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff1(correct_gff3, test_gff_out) at test-io-gff.R:135:2
 2.   ├─rtracklayer::export.gff1(x, file, index = index)
 3.   └─rtracklayer::export.gff1(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff1", ...)
 5.     └─BiocIO::export(object, con, "gff1", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:140:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff2(correct_gff3, test_gff_out) at test-io-gff.R:140:2
 2.   ├─rtracklayer::export.gff2(x, file, index = index)
 3.   └─rtracklayer::export.gff2(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff2", ...)
 5.     └─BiocIO::export(object, con, "gff2", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:145:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff3(correct_gff3, test_gff_out) at test-io-gff.R:145:2
 2.   ├─rtracklayer::export.gff3(x, file, index = index)
 3.   └─rtracklayer::export.gff3(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff3", ...)
 5.     └─BiocIO::export(object, con, "gff3", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:151:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff2(correct_gff3, test_gff2_out) at test-io-gff.R:151:2
 2.   ├─rtracklayer::export.gff2(x, file, index = index)
 3.   └─rtracklayer::export.gff2(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff2", ...)
 5.     └─BiocIO::export(object, con, "gff2", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)
── Warning ('test-io-gff.R:157:3'): writing/reading other GFF files ────────────
The phase information is missing for some CDS. The written file will
  contain some CDS with no phase information.
Backtrace:
    ▆
 1. └─plyranges::write_gff1(correct_gff3, test_gff1_out) at test-io-gff.R:157:2
 2.   ├─rtracklayer::export.gff1(x, file, index = index)
 3.   └─rtracklayer::export.gff1(x, file, index = index)
 4.     ├─BiocIO::export(object, con, "gff1", ...)
 5.     └─BiocIO::export(object, con, "gff1", ...)
 6.       ├─BiocIO::export(object, FileForFormat(con, format), ...)
 7.       └─rtracklayer::export(object, FileForFormat(con, format), ...)
 8.         └─rtracklayer (local) .local(object, con, format, ...)

[ FAIL 0 | WARN 9 | SKIP 0 | PASS 390 ]

Tests complete

but when I run the check, I get an error related to my rewriting of the mutate function when I have grouped data:

==> R CMD build plyranges

* checking for file ‘plyranges/DESCRIPTION’ ... OK
* preparing ‘plyranges’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘an-introduction.Rmd’ using rmarkdown
Error: processing vignette 'an-introduction.Rmd' failed with diagnostics:
C stack usage  7969952 is too close to the limit
--- failed re-building ‘an-introduction.Rmd’

SUMMARY: processing the following file failed:
  ‘an-introduction.Rmd’

Error: Vignette re-building failed.
Execution halted

Exited with status 1.

I don't understand why the tests pass but when running the checks it fails. I realise too that I was using the Rstudio GUI for running test and check operations but I could also do it with devtools. It seems devtools offers more flexibility for debugging?

ppaxisa commented 1 year ago

Update: I realise devtools::test was much more informative and would actually trigger the errors (not sure what's the difference with Rstudio GUI test):

══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 19.4 s

── Failed tests ──────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-colwise.R:51:1): mutating by groups
<CStackOverflowError/stackOverflowError/error/condition>
Error: C stack usage  7969472 is too close to the limit

Error (test-construction.R:63:3): Working with names
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `name = names(.data)`.
Caused by error:
! `name` must be size 3 or 1, not 0.
Backtrace:
     ▆
  1. ├─plyranges::names_to_column(ir) at test-construction.R:63:2
  2. │ ├─dplyr::mutate(.data, `:=`(!!var, names(.data))) at plyranges/R/ranges-colwise.R:43:2
  3. │ └─plyranges:::mutate.Ranges(.data, `:=`(!!var, names(.data)))
  4. │   └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
  5. │     └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
  6. │       ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
  7. │       ├─dplyr::mutate(df, !!!dots)
  8. │       └─dplyr:::mutate.data.frame(df, !!!dots)
  9. │         └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 10. │           ├─base::withCallingHandlers(...)
 11. │           └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 12. │             └─mask$eval_all_mutate(quo)
 13. │               └─dplyr (local) eval()
 14. ├─methods::as(., "DataFrame")
 15. │ └─methods:::.class1(object)
 16. ├─dplyr::select(., -all_of(core_cols))
 17. ├─dplyr::ungroup(.)
 18. ├─dplyr:::dplyr_internal_error(...)
 19. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 20. │   └─rlang:::signal_abort(cnd, .file)
 21. │     └─base::signalCondition(cnd)
 22. └─dplyr (local) `<fn>`(`<dpl:::__>`)
 23.   └─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Error (test-disjoin.R:67:3): matches IRanges/GRanges tests
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `name = Rle(name)`.
Caused by error:
! `name` must be a vector, not a <Rle> object.
Backtrace:
     ▆
  1. ├─disjoin_ranges(gr_by_group) %>% mutate(name = Rle(name)) at test-disjoin.R:67:2
  2. ├─dplyr::mutate(., name = Rle(name))
  3. ├─plyranges:::mutate.Ranges(., name = Rle(name))
  4. │ └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
  5. │   └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
  6. │     ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
  7. │     ├─dplyr::mutate(df, !!!dots)
  8. │     └─dplyr:::mutate.data.frame(df, !!!dots)
  9. │       └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
 10. │         ├─base::withCallingHandlers(...)
 11. │         └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 12. │           └─mask$eval_all_mutate(quo)
 13. │             └─dplyr (local) eval()
 14. ├─methods::as(., "DataFrame")
 15. │ └─methods:::.class1(object)
 16. ├─dplyr::select(., -all_of(core_cols))
 17. ├─dplyr::ungroup(.)
 18. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
 19. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 20. │   └─rlang:::signal_abort(cnd, .file)
 21. │     └─base::signalCondition(cnd)
 22. └─dplyr (local) `<fn>`(`<dpl:::__>`)
 23.   └─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Error (test-unnest.R:15:3): expanding makes sense
<dplyr:::mutate_error/rlang_error/error/condition>
Error in `mutate(df, !!!dots)`: ℹ In argument: `col1 = IntegerList(a = 1, b = c(4, 5), c = c(2, 3))`.
Caused by error:
! `col1` must be a vector, not a <CompressedIntegerList> object.
Backtrace:
     ▆
  1. ├─dplyr::mutate(...) at test-unnest.R:15:2
  2. ├─plyranges:::mutate.Ranges(...)
  3. │ └─plyranges:::mutate_rng(.data, ...) at plyranges/R/dplyr-mutate.R:140:2
  4. │   └─plyranges:::mutate_mcols(as.data.frame(.data), dots[!core_cols]) at plyranges/R/dplyr-mutate.R:43:4
  5. │     ├─... %>% as("DataFrame") at plyranges/R/dplyr-mutate.R:7:2
  6. │     ├─dplyr::mutate(df, !!!dots)
  7. │     └─dplyr:::mutate.data.frame(df, !!!dots)
  8. │       └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  9. │         ├─base::withCallingHandlers(...)
 10. │         └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
 11. │           └─mask$eval_all_mutate(quo)
 12. │             └─dplyr (local) eval()
 13. ├─methods::as(., "DataFrame")
 14. │ └─methods:::.class1(object)
 15. ├─dplyr::select(., -all_of(core_cols))
 16. ├─dplyr::ungroup(.)
 17. ├─dplyr:::dplyr_internal_error("dplyr:::mutate_not_vector", `<named list>`)
 18. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 19. │   └─rlang:::signal_abort(cnd, .file)
 20. │     └─base::signalCondition(cnd)
 21. └─dplyr (local) `<fn>`(`<dpl:::__>`)
 22.   └─rlang::abort(message, class = error_class, parent = parent, call = error_call)

[ FAIL 4 | WARN 7 | SKIP 0 | PASS 366 ]

The first failure only occur when dealing with integer ranges (not genomic ranges).
The second one is trick: related to names_to_column, it's using mutate under the hood but with an already defused expression if I understand correctly.
The Last 2 are related to using mutate with CompressedIntegerList and Rle objects.

mikelove commented 1 year ago

Awesome, I will check this out this week. I'm flying tomorrow to a conference, but can pull your repo and poke around during downtime.

ppaxisa commented 1 year ago

I have solved the 2nd error related to name_to_column and id_to_column, see last commit.

Regarding the first error of Error: C stack usage 7969472 is too close to the limit, it seems to come from the fact that GroupedIntegerRanges does not inherit from IRanges and does not have a set method for as.data.frame. It should be an easy fix if we have GroupedIntegerRanges inherit from IRanges no?

library(plyranges)

ir1 <- IRanges(start = 1:10, width = 5)

class(ir1)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"

class(ir1 %>% group_by(grp))
## [1] "GroupedIntegerRanges"
## attr(,"package")
## [1] "plyranges"

is(ir1 %>% group_by(grp), "IRanges")
## [1] FALSE

inherits(ir1 %>% group_by(grp), "IRanges")
## [1] FALSE

ir1 %>% group_by(grp) %>% as.data.frame()
## Error: C stack usage  7971520 is too close to the limit

Update: I think the simplest solution is to ungroup before converting to data.frame. This removes the error, see commit

ppaxisa commented 1 year ago

Hi Mike,

I've done some reading on the 2 last error checks that pertain to mutate with CompressedIntegerList and Rle objects. If I understand correctly those are vector-like objects from S4vectors but not supported by vanilla-tidyverse. So I'm wondering if we've reached a dead-end because using vanilla-tidyverse would be a breaking change in these situations. The only vague idea I would have would be to check the class and if S4vectors use the old plyranges::mutate syntax, otherwise use vanilla-tidyverse.

mikelove commented 1 year ago

Sorry for the silence, past two weeks I was traveling a lot.

This makes sense to me, if a user wants to do mutate on a GroupedGenomicRanges, we can offer a big speed-up when operating on non-core columns, so long as these play well with dplyr (not classes defined in S4Vectors). This seems reasonable for now, and will speed up many operations. Our motivating example was working with transcripts/exons and wanting to group by gene, and from TxDb or EnsDb, you get back character columns.

mikelove commented 1 year ago

@ppaxisa sorry for the delay again, been a busy Fall as I'm starting to teach a new module. I'm happy to come through and start testing if that would be useful?

ppaxisa commented 1 year ago

Yes! But I have not submitted a commit to drop the tests on the S4vectors objects that were generating the last 2 build errors. If you can comment those out, the package should build. I'm on vacation now so won't touch the repo til I get back in ~10 days.

ppaxisa commented 1 year ago

Hi Mike, sorry for the slow turn around. I've made some commits to solve the last test errors: basically coercing all S4Vectors list-like objects to base::list to avoir errors on some tests. Also fixed some small issues. This is what I get when running devtools::check

==> devtools::check()

══ Documenting ═════════════════════════════════════════════════════════════════════════════════════════════════════
ℹ Installed roxygen2 version (7.2.3) doesn't match required (7.1.0)
✖ `check()` will not re-document this package
══ Building ════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• CFLAGS    : -Wall -pedantic -fdiagnostics-color=always
• CXXFLAGS  : -Wall -pedantic -fdiagnostics-color=always
• CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX14FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX17FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX20FLAGS: -Wall -pedantic -fdiagnostics-color=always
── R CMD build ─────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/Users/pierre-paul.axisa/PPA_misc/plyranges/DESCRIPTION’ ...
─  preparing ‘plyranges’:
✔  checking DESCRIPTION meta-information ...
─  installing the package to build vignettes
✔  creating vignettes (29.5s)
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Removed empty directory ‘plyranges/pkgdown’
   Removed empty directory ‘plyranges/tests/testthat/_snaps’
─  building ‘plyranges_1.13.1.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

══ Checking ════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• _R_CHECK_CRAN_INCOMING_REMOTE_               : FALSE
• _R_CHECK_CRAN_INCOMING_                      : FALSE
• _R_CHECK_FORCE_SUGGESTS_                     : FALSE
• _R_CHECK_PACKAGES_USED_IGNORE_UNUSED_IMPORTS_: FALSE
• NOT_CRAN                                     : true
── R CMD check ─────────────────────────────────────────────────────────────────────────────────────────────────────
─  using log directory ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck’
─  using R version 4.3.0 (2023-04-21)
─  using platform: x86_64-apple-darwin20 (64-bit)
─  R was compiled by
       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
       GNU Fortran (GCC) 12.2.0
─  running under: macOS Monterey 12.5
─  using session charset: UTF-8
─  using options ‘--no-manual --as-cran’ (413ms)
✔  checking for file ‘plyranges/DESCRIPTION’
─  checking extension type ... Package
─  this is package ‘plyranges’ version ‘1.13.1’
─  package encoding: UTF-8
✔  checking package namespace information ...
✔  checking package dependencies (656ms)
✔  checking if this is a source package
✔  checking if there is a namespace
✔  checking for executable files (753ms)
✔  checking for hidden files and directories
✔  checking for portable file names ...
✔  checking for sufficient/correct file permissions
─  checking whether package ‘plyranges’ can be installed ... [20s/20s] WARNING (20.5s)
   Found the following significant warnings:
     Warning: package ‘IRanges’ was built under R version 4.3.1
     Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
   See ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00install.out’ for details.
✔  checking installed package size ...
✔  checking package directory ...
✔  checking for future file timestamps ...
✔  checking ‘build’ directory
✔  checking DESCRIPTION meta-information ...
✔  checking top-level files ...
✔  checking for left-over files ...
✔  checking index information ...
✔  checking package subdirectories ...
✔  checking R files for non-ASCII characters ...
✔  checking R files for syntax errors ...
✔  checking whether the package can be loaded (5.4s)
✔  checking whether the package can be loaded with stated dependencies (5.4s)
✔  checking whether the package can be unloaded cleanly (5.4s)
✔  checking whether the namespace can be loaded with stated dependencies (5.2s)
✔  checking whether the namespace can be unloaded cleanly (5.4s)
✔  checking loading without being on the library search path (5.4s)
✔  checking dependencies in R code (10.9s)
W  checking S3 generic/method consistency (5.4s)
   group_split:
     function(.tbl, ..., .keep)
   group_split.GroupedGenomicRanges:
     function(.data, ..., keep)

   group_split:
     function(.tbl, ..., .keep)
   group_split.GroupedIntegerRanges:
     function(.data, ..., keep)

   group_split:
     function(.tbl, ..., .keep)
   group_split.Ranges:
     function(.data, ..., keep)

   group_keys:
     function(.tbl, ...)
   group_keys.Ranges:
     function(.data, ...)

   group_keys:
     function(.tbl, ...)
   group_keys.GroupedGenomicRanges:
     function(.data, ...)

   group_keys:
     function(.tbl, ...)
   group_keys.GroupedIntegerRanges:
     function(.data, ...)
   See section ‘Generic functions and methods’ in the ‘Writing R
   Extensions’ manual.
✔  checking replacement functions (5.4s)
✔  checking foreign function calls (5.5s)
─  checking R code for possible problems ... [26s/26s] OK (26.3s)
✔  checking Rd files (366ms)
✔  checking Rd metadata ...
✔  checking Rd line widths ...
W  checking Rd cross-references ...
   Missing link or links in documentation object 'ranges-info.Rd':
     ‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’

   See section 'Cross-references' in the 'Writing R Extensions' manual.
✔  checking for missing documentation entries (5.4s)
✔  checking for code/documentation mismatches (16.8s)
✔  checking Rd \usage sections (5.7s)
✔  checking Rd contents ...
✔  checking for unstated dependencies in examples ...
✔  checking installed files from ‘inst/doc’ ...
✔  checking files in ‘vignettes’ ...
─  checking examples ... [28s/28s] OK (28.1s)
✔  checking for unstated dependencies in ‘tests’ ...
─  checking tests ...
    [25s/25s] OKthat.R’
   * checking for unstated dependencies in vignettes ... OK
   * checking package vignettes in ‘inst/doc’ ... OK
   * checking re-building of vignette outputs ... OK
   * checking for non-standard things in the check directory ... OK
   * checking for detritus in the temp directory ... OK
   * DONE

   Status: 3 WARNINGs
   See
     ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00check.log’
   for details.

── R CMD check results ─────────────────────────────────────────────────────────────────────── plyranges 1.13.1 ────
Duration: 3m 22.5s

❯ checking whether package ‘plyranges’ can be installed ... [20s/20s] WARNING
  See below...

❯ checking S3 generic/method consistency ... WARNING
  group_split:
    function(.tbl, ..., .keep)
  group_split.GroupedGenomicRanges:
    function(.data, ..., keep)

  group_split:
    function(.tbl, ..., .keep)
  group_split.GroupedIntegerRanges:
    function(.data, ..., keep)

  group_split:
    function(.tbl, ..., .keep)
  group_split.Ranges:
    function(.data, ..., keep)

  group_keys:
    function(.tbl, ...)
  group_keys.Ranges:
    function(.data, ...)

  group_keys:
    function(.tbl, ...)
  group_keys.GroupedGenomicRanges:
    function(.data, ...)

  group_keys:
    function(.tbl, ...)
  group_keys.GroupedIntegerRanges:
    function(.data, ...)
  See section ‘Generic functions and methods’ in the ‘Writing R
  Extensions’ manual.

❯ checking Rd cross-references ... WARNING
  Missing link or links in documentation object 'ranges-info.Rd':
    ‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’

  See section 'Cross-references' in the 'Writing R Extensions' manual.

0 errors ✔ | 3 warnings ✖ | 0 notes ✔
Error: R CMD check found WARNINGs
Execution halted

Exited with status 1.

PS: I did not manage to regenerate the documentation so I fixed some code in examples Rd files manually.

Not sure why exit status is 1 if there are only warnings. Is that normal?

mikelove commented 1 year ago

I think CHECK does exit status 1 for both warnings and errors.

So these generics and methods have different signature, did that pre-exist your additions?

Remind me, in the expand_ranges example there is an ungrouped GRanges, why did this have to change the mutate call from IntegerList to list?

ppaxisa commented 1 year ago

RE generics and methods, yes all of those existed before and I did not touch it.

RE expand_ranges, and any mutate event really in my implementation, the mutate call gets passed early to dplyr::mutate and which does not handle S4Vectors classes. That is how I wrote it anyway. I thought that even without any groupings, things like arithmetic operations would go faster in vanilla dplyr... That said, I think I can also twist things backwards and only use my implementation when dealing with a grouped object. lmk

mikelove commented 1 year ago

re: generics and methods, ok so we can PR regardless of those warnings i think.

Let's check in with @sa-lee on that because I think the last time I spoke with him we were thinking of a fork:

                        / ranges --> standard plyranges
mutate() is called --> 
                        \ grouped ranges --> our new approach leveraging dplyr

...because from what I remember, mutate on ungrouped ranges wasn't so bad, not nearly as bad as mutate on grouped ranges, which was getting passed to lapply. Stuart even optimized the former IIRC.

ppaxisa commented 1 year ago

Sounds good. Also I wanted to run the coercion back to DataFrame past you:

mutate_mcols <- function(df, dots) {

  core_cols <- intersect(colnames(df),c("start", "end", "width", "seqnames", "strand"))

  df <- mutate(df, !!!dots) %>%
    ungroup() %>%
    dplyr::select(-tidyselect::all_of(core_cols)) %>%
    as("DataFrame")
}

I'm just putting a as("DataFrame") when finished with vanilla Tidyverse operations, so I can put the meta data back into the ranges object. But I'm not sure what needs to be checked when converting between data.frame/tibble and DataFrame...

mikelove commented 1 year ago

Ok I should be able to check today

mikelove commented 11 months ago

November slipped away from me. @ppaxisa what would it take to scale back from modification of all mutate events to only modification of mutate on grouped GRanges?

ppaxisa commented 11 months ago

Hi Mike, I don't think too much work: I rewrote the 2 internal functions to be "tidyverse-dependent". But I could bring back the original ones and split between original vs new functions when dealing with non-grouped vs grouped objects. I'm a bit busy with faculty applications rn, but I'll try to push a commit over the break if I get a moment.

mikelove commented 11 months ago

Ok let’s loop back later.

ppaxisa commented 10 months ago

Hi Mike, here I am! Finally found some time to roll back some changes so we use the original plyranges::mutate functions when dealing with ungrouped ranges vs vanilla dplyr::mutate when dealing with grouped ranges. I've moved a few things here and there in the original functions but it's basically doing the same thing as the original.

Here is the devtools::check result:

==> devtools::check()

══ Documenting ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
ℹ Updating plyranges documentation
ℹ Loading plyranges
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: IRanges
Warning: package ‘IRanges’ was built under R version 4.3.1
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:utils’:

    findMatches

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
Warning: replacing previous import ‘dplyr::intersect’ by ‘GenomicRanges::intersect’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::setdiff’ by ‘GenomicRanges::setdiff’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::union’ by ‘GenomicRanges::union’ when loading ‘plyranges’
Warning: replacing previous import ‘dplyr::first’ by ‘S4Vectors::first’ when loading ‘plyranges’
Writing NAMESPACE
Writing NAMESPACE
Writing ranges-expand.Rd

══ Building ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• CFLAGS    : -Wall -pedantic -fdiagnostics-color=always
• CXXFLAGS  : -Wall -pedantic -fdiagnostics-color=always
• CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX14FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX17FLAGS: -Wall -pedantic -fdiagnostics-color=always
• CXX20FLAGS: -Wall -pedantic -fdiagnostics-color=always
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/Users/pierre-paul.axisa/PPA_misc/plyranges/DESCRIPTION’ ...
─  preparing ‘plyranges’: (403ms)
✔  checking DESCRIPTION meta-information ...
─  installing the package to build vignettes
✔  creating vignettes (30.6s)
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Removed empty directory ‘plyranges/pkgdown’
   Removed empty directory ‘plyranges/tests/testthat/_snaps’
─  building ‘plyranges_1.13.1.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

══ Checking ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Setting env vars:
• _R_CHECK_CRAN_INCOMING_REMOTE_               : FALSE
• _R_CHECK_CRAN_INCOMING_                      : FALSE
• _R_CHECK_FORCE_SUGGESTS_                     : FALSE
• _R_CHECK_PACKAGES_USED_IGNORE_UNUSED_IMPORTS_: FALSE
• NOT_CRAN                                     : true
── R CMD check ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─  using log directory ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck’
─  using R version 4.3.0 (2023-04-21)
─  using platform: x86_64-apple-darwin20 (64-bit)
─  R was compiled by
       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
       GNU Fortran (GCC) 12.2.0
─  running under: macOS Monterey 12.5
─  using session charset: UTF-8
─  using options ‘--no-manual --as-cran’ (341ms)
✔  checking for file ‘plyranges/DESCRIPTION’
─  checking extension type ... Package
─  this is package ‘plyranges’ version ‘1.13.1’
─  package encoding: UTF-8
✔  checking package namespace information ...
✔  checking package dependencies (911ms)
✔  checking if this is a source package
✔  checking if there is a namespace
✔  checking for executable files (782ms)
✔  checking for hidden files and directories
✔  checking for portable file names ...
✔  checking for sufficient/correct file permissions ...
─  checking whether package ‘plyranges’ can be installed ... [20s/21s] WARNING (20.7s)
   Found the following significant warnings:
     Warning: package ‘IRanges’ was built under R version 4.3.1
     Warning: package ‘GenomeInfoDb’ was built under R version 4.3.1
     Warning: replacing previous import ‘dplyr::intersect’ by ‘GenomicRanges::intersect’ when loading ‘plyranges’
     Warning: replacing previous import ‘dplyr::setdiff’ by ‘GenomicRanges::setdiff’ when loading ‘plyranges’
     Warning: replacing previous import ‘dplyr::union’ by ‘GenomicRanges::union’ when loading ‘plyranges’
     Warning: replacing previous import ‘dplyr::first’ by ‘S4Vectors::first’ when loading ‘plyranges’
   See ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00install.out’ for details.
✔  checking installed package size ...
✔  checking package directory ...
✔  checking for future file timestamps ...
✔  checking ‘build’ directory
✔  checking DESCRIPTION meta-information ...
✔  checking top-level files
✔  checking for left-over files ...
✔  checking index information ...
✔  checking package subdirectories ...
✔  checking R files for non-ASCII characters ...
✔  checking R files for syntax errors ...
✔  checking whether the package can be loaded (5.5s)
✔  checking whether the package can be loaded with stated dependencies (5.4s)
✔  checking whether the package can be unloaded cleanly (5.5s)
✔  checking whether the namespace can be loaded with stated dependencies (5.6s)
✔  checking whether the namespace can be unloaded cleanly (5.7s)
✔  checking loading without being on the library search path (5.5s)
✔  checking dependencies in R code (11.2s)
W  checking S3 generic/method consistency (6s)
   group_keys:
     function(.tbl, ...)
   group_keys.Ranges:
     function(.data, ...)

   group_keys:
     function(.tbl, ...)
   group_keys.GroupedGenomicRanges:
     function(.data, ...)

   group_keys:
     function(.tbl, ...)
   group_keys.GroupedIntegerRanges:
     function(.data, ...)

   group_split:
     function(.tbl, ..., .keep)
   group_split.GroupedGenomicRanges:
     function(.data, ..., keep)

   group_split:
     function(.tbl, ..., .keep)
   group_split.GroupedIntegerRanges:
     function(.data, ..., keep)

   group_split:
     function(.tbl, ..., .keep)
   group_split.Ranges:
     function(.data, ..., keep)
   See section ‘Generic functions and methods’ in the ‘Writing R
   Extensions’ manual.
✔  checking replacement functions (5.8s)
✔  checking foreign function calls (5.3s)
─  checking R code for possible problems ... [27s/27s] OK (26.9s)
✔  checking Rd files (363ms)
✔  checking Rd metadata ...
✔  checking Rd line widths ...
W  checking Rd cross-references ...
   Missing link or links in documentation object 'ranges-info.Rd':
     ‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’

   See section 'Cross-references' in the 'Writing R Extensions' manual.
✔  checking for missing documentation entries (5.4s)
✔  checking for code/documentation mismatches (16.8s)
✔  checking Rd \usage sections (5.9s)
✔  checking Rd contents ...
✔  checking for unstated dependencies in examples ...
✔  checking installed files from ‘inst/doc’ ...
✔  checking files in ‘vignettes’ ...
─  checking examples ... [27s/27s] OK (27.7s)
✔  checking for unstated dependencies in ‘tests’ ...
─  checking tests ...
    [25s/25s] OKthat.R’
   * checking for unstated dependencies in vignettes ... OK
   * checking package vignettes in ‘inst/doc’ ... OK
   * checking re-building of vignette outputs ... [10s/11s] OK
   * checking for non-standard things in the check directory ... OK
   * checking for detritus in the temp directory ... OK
   * DONE

   Status: 3 WARNINGs
   See
     ‘/Users/pierre-paul.axisa/PPA_misc/plyranges.Rcheck/00check.log’
   for details.

── R CMD check results ───────────────────────────────────────────────────────────────────────────────────────────────────────── plyranges 1.13.1 ────
Duration: 3m 26.2s

❯ checking whether package ‘plyranges’ can be installed ... [20s/21s] WARNING
  See below...

❯ checking S3 generic/method consistency ... WARNING
  group_keys:
    function(.tbl, ...)
  group_keys.Ranges:
    function(.data, ...)

  group_keys:
    function(.tbl, ...)
  group_keys.GroupedGenomicRanges:
    function(.data, ...)

  group_keys:
    function(.tbl, ...)
  group_keys.GroupedIntegerRanges:
    function(.data, ...)

  group_split:
    function(.tbl, ..., .keep)
  group_split.GroupedGenomicRanges:
    function(.data, ..., keep)

  group_split:
    function(.tbl, ..., .keep)
  group_split.GroupedIntegerRanges:
    function(.data, ..., keep)

  group_split:
    function(.tbl, ..., .keep)
  group_split.Ranges:
    function(.data, ..., keep)
  See section ‘Generic functions and methods’ in the ‘Writing R
  Extensions’ manual.

❯ checking Rd cross-references ... WARNING
  Missing link or links in documentation object 'ranges-info.Rd':
    ‘[GenomeInfoDb:fetchExtendedChromInfoFromUCSC]{GenomeInfoDb::fetchExtendedChromInfoFromUCSC()}’

  See section 'Cross-references' in the 'Writing R Extensions' manual.

0 errors ✔ | 3 warnings ✖ | 0 notes ✔
Error: R CMD check found WARNINGs
Execution halted

Exited with status 1.

Still a few old warnings (it was already there before).

I've branched this version at https://github.com/ImmuneAxisa/plyranges/tree/mutate_grp_revamp

mikelove commented 10 months ago

awesome! I will check this out right now.

mikelove commented 10 months ago

This is fantastic, I tried out the test example at the top of the issue. With the Bioc version:

   expr        min         lq       mean    median         uq        max neval
  plyra 607.711184 614.211119 619.617477 615.13423 621.733553 639.297297     5
  dplyr   3.620136   3.628664   3.848752   3.63711   4.114227   4.243623     5

And your new code:

   expr      min        lq      mean    median        uq       max neval
  plyra 28.33633 28.768101 35.222370 31.225723 40.978885 46.802812     5
  dplyr  5.40011  5.515771  5.544783  5.559354  5.623847  5.624831     5

20x speedup!

!> all.equal(x,y)
 [1] TRUE

I've got a PR ready for plyranges also, would be great to get these in before April.

ppaxisa commented 10 months ago

Great to hear! I'm again not familiar with this. What roadblocks do we need to clear to get the PR ready? Do we need to get rid of all warnings?

also, here is a little illustration of the internal function now:

Generated with

library(pkgnet)
library(tidyverse)
library(ggraph)
library(tidygraph)

report<- CreatePackageReport('plyranges')

report$FunctionReporter$pkg_graph$igraph %>% 
  as_tbl_graph() %>%
  activate(nodes) %>%
  dplyr::filter(grepl("mutate", name) & !grepl("Deferred", name)) %>%
  mutate(type = ifelse(grepl("\\.", name), "methods", "internal")) %>%
  ggraph(layout = "dh") +
  geom_edge_link(arrow = arrow(length = unit(10, 'mm')), 
                 end_cap = circle(10, 'mm')) +
  geom_node_label(aes(label = name, fill = type)) +
  scale_x_continuous(expand = expansion(0.3))

mikelove commented 7 months ago

Let’s get this in! Wanna PR?

separately, do you think I can solve the S3 generic/method consistency issue by changing .data to .tbl? I’ll work on that in the new devel branch.

ppaxisa commented 7 months ago

I can submit a PR but it says "This branch is 10 commits ahead of, 22 commits behind tidyomics/plyranges:devel.

Should I sync my repo first to the latest devel version?

I have mixed feelings about this method consistency: it's confusing to pretend that the input to the function is a tibble when it's not. But if warnings are not allowed to maintain the package on bioconductor, I guess we need to tend to it...

mikelove commented 7 months ago

I think you could sync or not. I don’t think that matters for the PR.

I’ll investigate the warnings more this week:

tidyomics / genomics-todos

Speed up `group_by` operations in `plyranges` #4