jtlandis / biocmask

Create abstraction masks for bioc objects that only unwrap necessary objects
Other
2 stars 0 forks source link

S4 Compressed/encoded Vectors #6

Open jtlandis opened 1 month ago

jtlandis commented 1 month ago

The BioConductor ecosystem has made great effort to store their datatypes in a very efficient manor. These datatypes then mascarade as full length vectors, allowing you to subset up to a length n, when in fact the underlying data storages are vectors length < n.

S4Vectors::Rle() is one such example. This is at odds with how dplyr and other tidyverse functions operate. If there isn't a method that works with vctrs::vec_chop then biocmask may fail.

Why not just use [ methods?

We could provide this as a fallback, but certain S4Vectors are extremely slow to subset if the mascaraded length is large. For example, the CompressedGRangesList object is highly compressed representation. Locally, I have a large CompressedGRangesList object that represents a list of ~250,000 elements, with over ~1.6 million Genomic Ranges in total.

Because this object is so large, attempting to split this into a simple list takes an unreasonable amount of time. We can make a tibble representation in approximately 4 seconds, as opposed to over 7 minutes using the standard representation and subsetting. Unfortunately, attempting to restore to a CompressedGRangesList is a bit more difficult as I do not think this is a native class we can construct.

jtlandis commented 1 month ago

Proposed Solution

Create 2 new S3 generic functions, as_s4_proxy() and s4_proxy_restore(). The goal of as_s4_proxy will be to create a new vector that works well within our functions. The proxy vectors WILL NOT inherit S4 generic functions of their S4 object counterparts. s4_proxy_restore will be responsible for returning the proxied vector to its original type. The ultimate goal behind these functions is to provide a more efficient subsetting version of the S4 objects.

biocmask will create generics for common S4 types we expect to encounter regularly.

as_s4_proxy <- function(x) {
  UseMethod("as_s4_proxy")
}

s4_proxy_restore <- function(x) {
  UseMethod("s4_proxy_restore")
}

new_s4_proxy <- function(.data, .class) {
  # attr(.data, "s4_proxy") <- original
  class(.data) <- c("s4_proxy", sprintf("%s_proxy", .class), class(.data))
  .data
}

`[.s4_proxy` <- function(x, i, j) {
  cl <- class(x)
  out <- NextMethod()
  class(out) <- cl
  out
}

Example S4Vectors::Rle()


as_s4_proxy.Rle <- function(x) {
  .data <- rep(x@values, times = x@lengths)
  new_s4_proxy(.data, class(x))
}

s4_proxy_restore.Rle_proxy <- function(x) {
  S4Vectors::Rle(values = x)  
}

Example IRanges()

as_s4_proxy.IRanges <- function(x) {
  new_s4_proxy(
    tibble(start = x@start,
         width = x@width,
         NAMES = x@NAMES), class(x))
}

s4_proxy_restore.IRanges_proxy <- function(x) {
  IRanges::IRanges(
    start = x$start,
    width = x$width,
    names = x$NAMES
  )
}

Example GenomicRanges::GRanges

as_s4_proxy.GRanges <- function(x) {

  new_s4_proxy(
    tibble(
      seqnames = zap_proxy(as_s4_proxy(x@seqnames)),
      ranges = zap_proxy(as_s4_proxy(x@ranges)),
      strand = zap_proxy(as_s4_proxy(x@strand))
    ), 
    class(x)
  )

}

s4_proxy_restore.GRanges_proxy <- function(x) {
  GenomicRanges::GRanges(
    seqnames = S4Vectors::Rle(x$seqnames),
    ranges = IRanges::IRanges(
      start = x$ranges$start,
      width = x$ranges$width,
      names = x$ranges$NAMES
    ),
    strand = S4Vectors::Rle(x$strand)
  )
}

Example "CompressedGRangesList"

as_s4_proxy.CompressedGRangesList <- function(x) {

  indices <- purrr::map2(IRanges::start(x@partitioning),
                         IRanges::end(x@partitioning),
                         `:`)
  proxy <- as_s4_proxy.GRanges(x@unlistData)
  new_list_of(
    vctrs::vec_chop(proxy, indices),
    ptype = tibble::tibble(),
    class = c("s4_proxy", sprintf("%s_proxy", class(x))),
    partitioning = x@partitioning,
    elementMetadata = x@elementMetadata
  )
}

`[.CompressedGRangesList_proxy` <- function(x, i) {
  p <- attr(x, "partitioning")
  out <- NextMethod()
  attr(out, "partitioning") <- IRanges::PartitioningByEnd(
    x = cumsum(IRanges::width(p)[i]), names = names(p)[i]
  )
  attr(out, "elementMetadata") <- attr(out, "elementMetadata")[i,]
  out
}

s4_proxy_restore.CompressedGRangesList_proxy <- function(x) {
  # browser()

  p <- attr(x, "partitioning")

  flattened <- lapply(1:3, function(i) {
    lapply(x, .subset2, i)
  })

  # comparable to `c()`, but uses less memory
  seqnames <- vec_c(rlang::splice(flattened[[1L]]))
  ranges <- dplyr::bind_rows(flattened[[2L]])
  strand <- vec_c(rlang::splice(flattened[[3L]]))

  gr <- GenomicRanges::GRanges(
    seqnames = S4Vectors::Rle(seqnames),
    ranges = IRanges::IRanges(
      start = ranges$start,
      width = ranges$width,
      names = ranges$NAMES
    ),
    strand = S4Vectors::Rle(strand)
  )

  out <- as(gr, "CompressedGRangesList")
  out@partitioning  <- p
  out@elementMetadata <- attr(x, "elementMetadata")
  out

}
jtlandis commented 3 weeks ago

An additional work around was to re-export vec_slice as an S7 generic to support a single API in slicing any kind of vector. This allows to use the standard vctrs::vec_slice() which is fairly efficient back ends, and allows to optimize slicing methods for other S4Vectors classes ("CompressedGRangesList" already being implemented as an example). Anything inheriting from S4Vectors::Vectors virtual class will use [ under the hood.