jonocarroll / DFplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`
GNU General Public License v3.0
21 stars 0 forks source link

Why wrap DataFrame objects in src_DF objects (a.k.a. "dplyr-compatible DataFrames")? #3

Closed hpages closed 4 years ago

hpages commented 4 years ago

First of all, thanks for trying to make the power and flexibility of DataFrame objects available to the dplyr fans!

I'd like to point the following problem with the current approach though. I hope you don't mind.

Problem: "dplyr-compatible DataFrames" are incompatible with anything that expects a DataFrame!

If the idea is to support the dplyr syntax on DataFrame objects, e.g. mutate(), why do you need to introduce src_DF objects? Why not just define a mutate() method that works on DataFrame objects instead? Then the user could do mutate(DF, newvar = cyl + hp) directly on their DataFrame object (i.e. without having to call add_dplyr_compat() on them first) and this would return a DataFrame.

A typical use case is to add a new metadata columns to a Vector derivative x with something like this:

mcols(x) <- mutate(mcols(x), newvar = cyl + hp)

Right now, in order to do the above they would need to do something like:

d <- add_dplyr_compat(mcols(x))
d <- mutate(d, newvar = cyl + hp)
mcols(x) <- as(d, "DataFrame")

(Note that as(d, "DataFrame") gives an error at the moment, but that should be fixable.)

So the convenience of the dplyr syntax is defeated by all the extra work they need to do to switch back and forth between DataFrame and src_DF objects.

Another thing I noticed is that src_DF objects don't seem to do well with S4 objects as columns (note that the ability to store all sorts of vector-like objects in their columns is a great strength of DataFrame objects and what motivated their introduction):

suppressPackageStartupMessages({
  library(S4Vectors)
  library(GenomicRanges)
  library(dplyr)
  library(DFplyr)
})
m <- mtcars[, c("cyl", "hp", "am")]
DF <- as(m, "DataFrame")
d <- add_dplyr_compat(DF)
d$gr <- GRanges("chrY", IRanges(1:32, width=10))
d
# dplyr-compatible DataFrame with 32 rows and 3 columns
#                         cyl        hp        am
#                   <numeric> <numeric> <numeric>
# Mazda RX4                 6       110         1
# Mazda RX4 Wag             6       110         1
# Datsun 710                4        93         1
# Hornet 4 Drive            6       110         0
# Hornet Sportabout         8       175         0
# ...                     ...       ...       ...
# Lotus Europa              4       113         1
# Ford Pantera L            8       264         1
# Ferrari Dino              6       175         1
# Maserati Bora             8       335         1
# Volvo 142E                4       109         1

The gr column is missing!

If I try to add the column early, I end up with a broken src_DF object:

DF$gr <- GRanges("chrY", IRanges(1:32, width=10))
d <- add_dplyr_compat(DF)
d
# dplyr-compatible DataFrame with 32 rows and 4 columns
# Error in dimnames(x) <- dn : 
#   length of 'dimnames' [1] not equal to array extent
# In addition: Warning messages:
# 1: In cbind2(argl[[i]], r) :
#   number of rows of result is not a multiple of vector length (arg 1)
# 2: In cbind2(argl[[i]], r) :
#   number of rows of result is not a multiple of vector length (arg 1)
# 3: In cbind2(argl[[i]], r) :
#   number of rows of result is not a multiple of vector length (arg 1)

Best, H.

jonocarroll commented 4 years ago

Yep, that's a deal-breaker.

I'll have a closer look at how to get dplyr to S3 dispatch on an S4 class. I agree the current way isn't suitable. I was hoping that the whole 'dplyr backend' thing would work better for this case but perhaps that's not the best route.

DarwinAwardWinner commented 4 years ago

We already talked on Twitter, but I'll just link it here as well: https://twitter.com/DarwinAwdWinner/status/1220838139264061440 https://stat.ethz.ch/R-manual/R-devel/library/methods/html/Methods_for_S3.html

jonocarroll commented 4 years ago

Thank you both - this new version (https://github.com/jonocarroll/DFplyr/pull/4) now much more closely matches what I originally envisioned, and hopefully works significantly better. I still need to work through a few cleanup issues but your feedback would be greatly appreciated.

sa-lee commented 4 years ago

@jonocarroll i'm sure @lawremi has opinions on this too - i would also look into some of the work we did writing a tidy eval backend for GRanges. would love to help out more in a couple of months once i'm done with phd stuff!

alanocallaghan commented 4 years ago

I don't know that this is the "proper" way, but do you even need setMethod calls? I think you can just set s3 methods to dispatch on the class attribute (eg s3), since they don't really do any deep inspection of the objects.

library("SingleCellExperiment")
sce <- SingleCellExperiment()
print(sce)
#> class: SingleCellExperiment 
#> dim: 0 0 
#> metadata(0):
#> assays(0):
#> rownames: NULL
#> rowData names(0):
#> colnames: NULL
#> colData names(0):
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):
print.SingleCellExperiment <- function(x) "Surprise!"
print(sce)
#> [1] "Surprise!"

It's still preferable to mark them as s3 methods so that they're not exported and to ensure method dispatch finds them (roxygen2 used to have an @s3method foo bar tag but I think this is defunct).

lawremi commented 4 years ago

Yes, this is a valid approach. Note that there is nothing special about the evaluation of an S4 method vs. an S3 method. All the same slots etc. are accessible.

jonocarroll commented 4 years ago

Thanks all, this has certainly been a learning experience for me! Verbs are now implemented as simple S3 dispatch and everything seems to work as expected. I still need to look into the ways to add e.g. GRanges columns, so if anyone has some examples to test that would be great.

jonocarroll commented 4 years ago

Actually, I think I have that working, too...

sa-lee commented 4 years ago

I think some tests for List columns would be a good idea too:

library(IRanges)
library(S4Vectors)
a = DataFrame(col = 1:3)
lc = NumericList(NA, runif(3), rnorm(5))
a <- mutate(a, new_col = 2*lc)
jonocarroll commented 4 years ago

I get

Error in 2 * lc : non-numeric argument to binary operator

but I'm not sure DFplyr is the problem... I get the same thing with a regular list column

x = tibble::tibble(col = 1:3, a = list(NA, rnorm(3), rnorm(5)))
dplyr::mutate(x, newcol = 2 * a)

This sounds like it needs a map.

I do, however, still have issues working with S4 columns.

DarwinAwardWinner commented 4 years ago

That's odd. Here's what you should get when multiplying a NumericList (no data frames involved):

> lc = NumericList(NA, runif(3), rnorm(5))
> lc
NumericList of length 3
[[1]] <NA>
[[2]] 0.711012468906119 0.134303599130362 0.757121869595721
[[3]] -0.482856492744561 2.34164965257033 ... 0.633040946945741
> lc * 2
NumericList of length 3
[[1]] <NA>
[[2]] 1.42202493781224 0.268607198260725 1.51424373919144
[[3]] -0.965712985489122 4.68329930514065 ... 1.26608189389148
sa-lee commented 4 years ago

So 2 * lc will work outside of mutate() cause there's a Math ops defined for List classes without the need for for an lapply. I'm guessing the conversion back to a regular list when you call mutate results in the error. Should this work?

sa-lee commented 4 years ago

btw this kind of thing was a huge pain point to get to work for plyranges and was the reason I needed to rewrite a lot of stuff without explicitly converting to tibble.

jonocarroll commented 4 years ago

It looks like the conversion to tibble doesn't perserve the List structure, it drops it to a regular list-column. I might be able to recover that.

That said, it looks like dplyr straight out doesn't support S4 columns at all, even if the operation doesn't involve them, and that's buried in the C code

x <- DataFrame(a = 1:4, gr = GRanges("chrY", IRanges(1:4, width = 10)))
dplyr::filter(x, a == 3)
Error: Column `gr` is of unsupported class GRanges

This part might take longer to resolve.

Alternatively, I could just rewrite the verbs in base and forego the conversion to tibble entirely.

jonocarroll commented 4 years ago

Current master branch now does a lot of hacking around but seems to enable S4 columns to be at least present in the input object, but not participate in the ... evaluations. I'm not sure I can ever override that, it's forbidden from deep within the C code. I might be able to inspect the quosure to see if a S4 column is attempting to be involved.

lawremi commented 4 years ago

Hasn't someone implemented dplyr on top of the base API?

jonocarroll commented 4 years ago

There's bplyr which is just base + rlang. This was my thinking above but I'd have to see if it works with the S4/list-column stuff better than what I now have.

jonocarroll commented 4 years ago

Going strictly base would be just calling e.g. [ for filter and would forego all the NSE parts which make dplyr nice to use.

jonocarroll commented 4 years ago

Okay, it works pretty well (looking at the internals of b_mutate.default, at least). It seems to support S4 columns in ... AND mutate(a, new_col = 2*lc)... I may have to borrow some of this code @yonicd

jonocarroll commented 4 years ago

@DarwinAwardWinner @sa-lee (and anyone interested) check out this branch: https://github.com/jonocarroll/DFplyr/tree/bplyr_integration

The base rewrite seems to largely work (after finding a lot of edge-cases), including ...

suppressWarnings(suppressMessages(library(DFplyr)))
a = S4Vectors::DataFrame(row = 1:3)
lc = IRanges::NumericList(NA, runif(3), rnorm(5))
lc
#> NumericList of length 3
#> [[1]] <NA>
#> [[2]] 0.418686023913324 0.0529155859258026 0.631206635152921
#> [[3]] -1.13643749728751 -1.51225224378803 ... -0.862702277841997
mutate(a, new_col = 2*lc)
#> dplyr-compatible DataFrame with 3 rows and 2 columns
#>         row                                                   new_col
#>   <integer>                                             <NumericList>
#> 1         1                                                        NA
#> 2         2      0.837372047826648,0.105831171851605,1.26241327030584
#> 3         3 -2.27287499457503,-3.02450448757607,-1.85582212151293,...

Created on 2020-02-03 by the reprex package (v0.3.0)

:tada:

The README on that branch has a lot of examples of what works - it's my current test.

hpages commented 4 years ago

Looks good... but hey, it looks like you are replacing the show() method for DataFrame objects defined in S4Vectors with your own. Why would you do that?

jonocarroll commented 4 years ago

I wanted to add the 'dplyr-compatible' line in the header, and added the 'Groups:' list when the object has grouping, similar to the group display in tibbles. I could presumably add a new DataFrame_grouped class which inherits and dispatches its own show() (?).

lawremi commented 4 years ago

We should decouple persistent grouping from any particular API implementation (like dplyr). Grouping should probably go into a more general package.

hpages commented 4 years ago

Not sure why you need that but the general rule is that you should never replace an existing method with your own. If you really need to do this, you first need to extend the class so then you can override instead of replacing. However, extending DFrame means that mutate() would no longer act as an endomorphism i.e. it would take a DFrame and return a DataFrame_grouped. Would be kind of confusing e.g. why would some DataFrame be displayed as being dplyr-compatible and others not, when the truth is that all of them are?

jonocarroll commented 4 years ago

The original reason was that 'real' DataFrames weren't compatible - I tried to use the dplyr backend route and it created a list with a DataFrame sub-element, so I had to explicitly add support. That's no longer the case, so I don't really need the 'dplyr' compatible part any more.

For the grouping information... I'm interested to hear ideas for how that can be displayed.

jonocarroll commented 4 years ago

At the moment, following from dplyr's implementation, groups are literally just a table-like structure added as an attribute (in this case, on listData) listing the groupings and the data rows relevant to each combination, plus a set of functions to act on those.

lawremi commented 4 years ago

I know @sa-lee had to do this for GRanges. Ideally it would be more formal than metadata. We'd want to avoid having to reimplement it for every implementation of DataFrame though.

jonocarroll commented 4 years ago

Yeah, this all started with a chat with Stuart. The big trip-up with this current implementation is that it essentially splits the data according to the grouping, whereas GRanges can (IIRC) efficiently perform a calculation in a groupwise manner if that info can be passed on.

hpages commented 4 years ago

We'd want to avoid having to reimplement it for every implementation of DataFrame though.

The only way to achieve this is by using composition instead of inheritance. With the following implications:

  1. All DataFrame derivatives are expected to have the metadata slot (DataFrame derives from Vector) so that could be used. However if we want something more formal then that means we need to add a groups slot at the level of the DataFrame class (which will soon become VIRTUAL).
  2. The consequence of 1. is that support for groups becomes an S4Vectors business.
hpages commented 4 years ago

Another approach would be to explore how SplitDataFrame could fit here. It's the S4Vectors way of doing what tibble/dplyr does with grouped_df objects:

> a <- S4Vectors::DataFrame(tx_id=11:15, score=runif(5), gene_id=c(2L, 1L, 5L, 2L, 2L))
> split(a, a$gene_id)  # group by gene_id
SplitDataFrameList of length 3
$`1`
dplyr-compatible DataFrame with 1 row and 3 columns
      tx_id     score   gene_id
  <integer> <numeric> <integer>
1        12  0.236817         1

$`2`
dplyr-compatible DataFrame with 3 rows and 3 columns
      tx_id     score   gene_id
  <integer> <numeric> <integer>
1        11  0.905781         2
2        14  0.391706         2
3        15  0.732056         2

$`5`
dplyr-compatible DataFrame with 1 row and 3 columns
      tx_id     score   gene_id
  <integer> <numeric> <integer>
1        13  0.940257         5

However I suspect that some differences in the exact semantic of what "grouping" means/does in tibble/dplyr vs S4Vectors might get in the way. For example the former allows grouping by more than one variable, keeps track of them, and "remembers" the original order of the rows, whereas with a DataFrameSplit object you cannot go back to the original DataFrame at the moment (unless you know what grouping variable was used in which case you can unsplit()).

jonocarroll commented 4 years ago

If nothing else, I may be able to use this in the current group-aware verbs - I think I currently do the split manually. The advantage of that was preserving rownames through the split operation for e.g. group_by %>% filter.