Closed hpages closed 4 years ago
Yep, that's a deal-breaker.
I'll have a closer look at how to get dplyr to S3 dispatch on an S4 class. I agree the current way isn't suitable. I was hoping that the whole 'dplyr backend' thing would work better for this case but perhaps that's not the best route.
We already talked on Twitter, but I'll just link it here as well: https://twitter.com/DarwinAwdWinner/status/1220838139264061440 https://stat.ethz.ch/R-manual/R-devel/library/methods/html/Methods_for_S3.html
Thank you both - this new version (https://github.com/jonocarroll/DFplyr/pull/4) now much more closely matches what I originally envisioned, and hopefully works significantly better. I still need to work through a few cleanup issues but your feedback would be greatly appreciated.
@jonocarroll i'm sure @lawremi has opinions on this too - i would also look into some of the work we did writing a tidy eval backend for GRanges. would love to help out more in a couple of months once i'm done with phd stuff!
I don't know that this is the "proper" way, but do you even need setMethod
calls? I think you can just set s3 methods to dispatch on the class attribute (eg s3), since they don't really do any deep inspection of the objects.
library("SingleCellExperiment")
sce <- SingleCellExperiment()
print(sce)
#> class: SingleCellExperiment
#> dim: 0 0
#> metadata(0):
#> assays(0):
#> rownames: NULL
#> rowData names(0):
#> colnames: NULL
#> colData names(0):
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):
print.SingleCellExperiment <- function(x) "Surprise!"
print(sce)
#> [1] "Surprise!"
It's still preferable to mark them as s3 methods so that they're not exported and to ensure method dispatch finds them (roxygen2 used to have an @s3method foo bar
tag but I think this is defunct).
Yes, this is a valid approach. Note that there is nothing special about the evaluation of an S4 method vs. an S3 method. All the same slots etc. are accessible.
Thanks all, this has certainly been a learning experience for me! Verbs are now implemented as simple S3 dispatch and everything seems to work as expected. I still need to look into the ways to add e.g. GRanges columns, so if anyone has some examples to test that would be great.
Actually, I think I have that working, too...
I think some tests for List
columns would be a good idea too:
library(IRanges)
library(S4Vectors)
a = DataFrame(col = 1:3)
lc = NumericList(NA, runif(3), rnorm(5))
a <- mutate(a, new_col = 2*lc)
I get
Error in 2 * lc : non-numeric argument to binary operator
but I'm not sure DFplyr is the problem... I get the same thing with a regular list column
x = tibble::tibble(col = 1:3, a = list(NA, rnorm(3), rnorm(5)))
dplyr::mutate(x, newcol = 2 * a)
This sounds like it needs a map.
I do, however, still have issues working with S4 columns.
That's odd. Here's what you should get when multiplying a NumericList (no data frames involved):
> lc = NumericList(NA, runif(3), rnorm(5))
> lc
NumericList of length 3
[[1]] <NA>
[[2]] 0.711012468906119 0.134303599130362 0.757121869595721
[[3]] -0.482856492744561 2.34164965257033 ... 0.633040946945741
> lc * 2
NumericList of length 3
[[1]] <NA>
[[2]] 1.42202493781224 0.268607198260725 1.51424373919144
[[3]] -0.965712985489122 4.68329930514065 ... 1.26608189389148
So 2 * lc
will work outside of mutate()
cause there's a Math ops defined for List
classes without the need for for an lapply
. I'm guessing the conversion back to a regular list
when you call mutate
results in the error. Should this work?
btw this kind of thing was a huge pain point to get to work for plyranges and was the reason I needed to rewrite a lot of stuff without explicitly converting to tibble.
It looks like the conversion to tibble doesn't perserve the List
structure, it drops it to a regular list-column. I might be able to recover that.
That said, it looks like dplyr straight out doesn't support S4 columns at all, even if the operation doesn't involve them, and that's buried in the C code
x <- DataFrame(a = 1:4, gr = GRanges("chrY", IRanges(1:4, width = 10)))
dplyr::filter(x, a == 3)
Error: Column `gr` is of unsupported class GRanges
This part might take longer to resolve.
Alternatively, I could just rewrite the verbs in base and forego the conversion to tibble entirely.
Current master branch now does a lot of hacking around but seems to enable S4 columns to be at least present in the input object, but not participate in the ...
evaluations. I'm not sure I can ever override that, it's forbidden from deep within the C code. I might be able to inspect the quosure to see if a S4 column is attempting to be involved.
Hasn't someone implemented dplyr on top of the base API?
There's bplyr which is just base + rlang. This was my thinking above but I'd have to see if it works with the S4/list-column stuff better than what I now have.
Going strictly base would be just calling e.g. [
for filter
and would forego all the NSE parts which make dplyr
nice to use.
Okay, it works pretty well (looking at the internals of b_mutate.default
, at least). It seems to support S4 columns in ...
AND mutate(a, new_col = 2*lc)
... I may have to borrow some of this code @yonicd
@DarwinAwardWinner @sa-lee (and anyone interested) check out this branch: https://github.com/jonocarroll/DFplyr/tree/bplyr_integration
The base rewrite seems to largely work (after finding a lot of edge-cases), including ...
suppressWarnings(suppressMessages(library(DFplyr)))
a = S4Vectors::DataFrame(row = 1:3)
lc = IRanges::NumericList(NA, runif(3), rnorm(5))
lc
#> NumericList of length 3
#> [[1]] <NA>
#> [[2]] 0.418686023913324 0.0529155859258026 0.631206635152921
#> [[3]] -1.13643749728751 -1.51225224378803 ... -0.862702277841997
mutate(a, new_col = 2*lc)
#> dplyr-compatible DataFrame with 3 rows and 2 columns
#> row new_col
#> <integer> <NumericList>
#> 1 1 NA
#> 2 2 0.837372047826648,0.105831171851605,1.26241327030584
#> 3 3 -2.27287499457503,-3.02450448757607,-1.85582212151293,...
Created on 2020-02-03 by the reprex package (v0.3.0)
:tada:
The README on that branch has a lot of examples of what works - it's my current test.
Looks good... but hey, it looks like you are replacing the show()
method for DataFrame objects defined in S4Vectors with your own. Why would you do that?
I wanted to add the 'dplyr-compatible' line in the header, and added the 'Groups:' list when the object has grouping, similar to the group display in tibbles. I could presumably add a new DataFrame_grouped
class which inherits and dispatches its own show()
(?).
We should decouple persistent grouping from any particular API implementation (like dplyr). Grouping should probably go into a more general package.
Not sure why you need that but the general rule is that you should never replace an existing method with your own. If you really need to do this, you first need to extend the class so then you can override instead of replacing. However, extending DFrame means that mutate()
would no longer act as an endomorphism i.e. it would take a DFrame and return a DataFrame_grouped. Would be kind of confusing e.g. why would some DataFrame be displayed as being dplyr-compatible and others not, when the truth is that all of them are?
The original reason was that 'real' DataFrame
s weren't compatible - I tried to use the dplyr
backend route and it created a list with a DataFrame
sub-element, so I had to explicitly add support. That's no longer the case, so I don't really need the 'dplyr' compatible part any more.
For the grouping information... I'm interested to hear ideas for how that can be displayed.
At the moment, following from dplyr
's implementation, groups are literally just a table-like structure added as an attribute (in this case, on listData
) listing the groupings and the data rows relevant to each combination, plus a set of functions to act on those.
I know @sa-lee had to do this for GRanges. Ideally it would be more formal than metadata. We'd want to avoid having to reimplement it for every implementation of DataFrame though.
Yeah, this all started with a chat with Stuart. The big trip-up with this current implementation is that it essentially split
s the data according to the grouping, whereas GRanges can (IIRC) efficiently perform a calculation in a groupwise manner if that info can be passed on.
We'd want to avoid having to reimplement it for every implementation of DataFrame though.
The only way to achieve this is by using composition instead of inheritance. With the following implications:
metadata
slot (DataFrame derives from Vector) so that could be used. However if we want something more formal then that means we need to add a groups
slot at the level of the DataFrame class (which will soon become VIRTUAL).Another approach would be to explore how SplitDataFrame could fit here. It's the S4Vectors way of doing what tibble/dplyr does with grouped_df objects:
> a <- S4Vectors::DataFrame(tx_id=11:15, score=runif(5), gene_id=c(2L, 1L, 5L, 2L, 2L))
> split(a, a$gene_id) # group by gene_id
SplitDataFrameList of length 3
$`1`
dplyr-compatible DataFrame with 1 row and 3 columns
tx_id score gene_id
<integer> <numeric> <integer>
1 12 0.236817 1
$`2`
dplyr-compatible DataFrame with 3 rows and 3 columns
tx_id score gene_id
<integer> <numeric> <integer>
1 11 0.905781 2
2 14 0.391706 2
3 15 0.732056 2
$`5`
dplyr-compatible DataFrame with 1 row and 3 columns
tx_id score gene_id
<integer> <numeric> <integer>
1 13 0.940257 5
However I suspect that some differences in the exact semantic of what "grouping" means/does in tibble/dplyr vs S4Vectors might get in the way. For example the former allows grouping by more than one variable, keeps track of them, and "remembers" the original order of the rows, whereas with a DataFrameSplit object you cannot go back to the original DataFrame at the moment (unless you know what grouping variable was used in which case you can unsplit()
).
If nothing else, I may be able to use this in the current group-aware verbs - I think I currently do the split manually. The advantage of that was preserving rownames through the split operation for e.g. group_by %>% filter.
First of all, thanks for trying to make the power and flexibility of DataFrame objects available to the dplyr fans!
I'd like to point the following problem with the current approach though. I hope you don't mind.
Problem: "dplyr-compatible DataFrames" are incompatible with anything that expects a DataFrame!
If the idea is to support the dplyr syntax on DataFrame objects, e.g.
mutate()
, why do you need to introduce src_DF objects? Why not just define amutate()
method that works on DataFrame objects instead? Then the user could domutate(DF, newvar = cyl + hp)
directly on their DataFrame object (i.e. without having to calladd_dplyr_compat()
on them first) and this would return a DataFrame.A typical use case is to add a new metadata columns to a Vector derivative
x
with something like this:Right now, in order to do the above they would need to do something like:
(Note that
as(d, "DataFrame")
gives an error at the moment, but that should be fixable.)So the convenience of the dplyr syntax is defeated by all the extra work they need to do to switch back and forth between DataFrame and src_DF objects.
Another thing I noticed is that src_DF objects don't seem to do well with S4 objects as columns (note that the ability to store all sorts of vector-like objects in their columns is a great strength of DataFrame objects and what motivated their introduction):
The
gr
column is missing!If I try to add the column early, I end up with a broken src_DF object:
Best, H.