Open jtlandis opened 1 month ago
Create 2 new S3
generic functions, as_s4_proxy()
and s4_proxy_restore()
. The goal of as_s4_proxy
will be to create a new vector that works well within our functions. The proxy vectors WILL NOT inherit S4 generic functions of their S4 object counterparts. s4_proxy_restore
will be responsible for returning the proxied vector to its original type. The ultimate goal behind these functions is to provide a more efficient subsetting version of the S4 objects.
biocmask
will create generics for common S4 types we expect to encounter regularly.
as_s4_proxy <- function(x) {
UseMethod("as_s4_proxy")
}
s4_proxy_restore <- function(x) {
UseMethod("s4_proxy_restore")
}
new_s4_proxy <- function(.data, .class) {
# attr(.data, "s4_proxy") <- original
class(.data) <- c("s4_proxy", sprintf("%s_proxy", .class), class(.data))
.data
}
`[.s4_proxy` <- function(x, i, j) {
cl <- class(x)
out <- NextMethod()
class(out) <- cl
out
}
S4Vectors::Rle()
as_s4_proxy.Rle <- function(x) {
.data <- rep(x@values, times = x@lengths)
new_s4_proxy(.data, class(x))
}
s4_proxy_restore.Rle_proxy <- function(x) {
S4Vectors::Rle(values = x)
}
IRanges()
as_s4_proxy.IRanges <- function(x) {
new_s4_proxy(
tibble(start = x@start,
width = x@width,
NAMES = x@NAMES), class(x))
}
s4_proxy_restore.IRanges_proxy <- function(x) {
IRanges::IRanges(
start = x$start,
width = x$width,
names = x$NAMES
)
}
GenomicRanges::GRanges
as_s4_proxy.GRanges <- function(x) {
new_s4_proxy(
tibble(
seqnames = zap_proxy(as_s4_proxy(x@seqnames)),
ranges = zap_proxy(as_s4_proxy(x@ranges)),
strand = zap_proxy(as_s4_proxy(x@strand))
),
class(x)
)
}
s4_proxy_restore.GRanges_proxy <- function(x) {
GenomicRanges::GRanges(
seqnames = S4Vectors::Rle(x$seqnames),
ranges = IRanges::IRanges(
start = x$ranges$start,
width = x$ranges$width,
names = x$ranges$NAMES
),
strand = S4Vectors::Rle(x$strand)
)
}
as_s4_proxy.CompressedGRangesList <- function(x) {
indices <- purrr::map2(IRanges::start(x@partitioning),
IRanges::end(x@partitioning),
`:`)
proxy <- as_s4_proxy.GRanges(x@unlistData)
new_list_of(
vctrs::vec_chop(proxy, indices),
ptype = tibble::tibble(),
class = c("s4_proxy", sprintf("%s_proxy", class(x))),
partitioning = x@partitioning,
elementMetadata = x@elementMetadata
)
}
`[.CompressedGRangesList_proxy` <- function(x, i) {
p <- attr(x, "partitioning")
out <- NextMethod()
attr(out, "partitioning") <- IRanges::PartitioningByEnd(
x = cumsum(IRanges::width(p)[i]), names = names(p)[i]
)
attr(out, "elementMetadata") <- attr(out, "elementMetadata")[i,]
out
}
s4_proxy_restore.CompressedGRangesList_proxy <- function(x) {
# browser()
p <- attr(x, "partitioning")
flattened <- lapply(1:3, function(i) {
lapply(x, .subset2, i)
})
# comparable to `c()`, but uses less memory
seqnames <- vec_c(rlang::splice(flattened[[1L]]))
ranges <- dplyr::bind_rows(flattened[[2L]])
strand <- vec_c(rlang::splice(flattened[[3L]]))
gr <- GenomicRanges::GRanges(
seqnames = S4Vectors::Rle(seqnames),
ranges = IRanges::IRanges(
start = ranges$start,
width = ranges$width,
names = ranges$NAMES
),
strand = S4Vectors::Rle(strand)
)
out <- as(gr, "CompressedGRangesList")
out@partitioning <- p
out@elementMetadata <- attr(x, "elementMetadata")
out
}
An additional work around was to re-export vec_slice
as an S7
generic to support a single API in slicing any kind of vector. This allows to use the standard vctrs::vec_slice()
which is fairly efficient back ends, and allows to optimize slicing methods for other S4Vectors classes ("CompressedGRangesList" already being implemented as an example). Anything inheriting from S4Vectors::Vectors
virtual class will use [
under the hood.
The BioConductor ecosystem has made great effort to store their datatypes in a very efficient manor. These datatypes then mascarade as full length vectors, allowing you to subset up to a length
n
, when in fact the underlying data storages are vectors length< n
.S4Vectors::Rle()
is one such example. This is at odds with howdplyr
and othertidyverse
functions operate. If there isn't a method that works withvctrs::vec_chop
thenbiocmask
may fail.Why not just use
[
methods?We could provide this as a fallback, but certain
S4Vectors
are extremely slow to subset if the mascaraded length is large. For example, theCompressedGRangesList
object is highly compressed representation. Locally, I have a largeCompressedGRangesList
object that represents a list of ~250,000 elements, with over ~1.6 million Genomic Ranges in total.Because this object is so large, attempting to split this into a simple list takes an unreasonable amount of time. We can make a tibble representation in approximately 4 seconds, as opposed to over 7 minutes using the standard representation and subsetting. Unfortunately, attempting to restore to a
CompressedGRangesList
is a bit more difficult as I do not think this is a native class we can construct.