Implement `group_by` and `with_groups` for `tidySingleCellExperiment` and `tidyseurat`

stemangiola commented 11 months ago

the issue for tidyseurat is here

https://github.com/stemangiola/tidyseurat/issues/65

stemangiola commented 11 months ago

group_split is also a new useful function

B0ydT commented 11 months ago

I'm interested in having a go at this. I've looked at the dplyr code and think I have a handle on it. Are there any implementations, especially part of tidyomics that you'd prefer I use as a starting point?

stemangiola commented 11 months ago

Hello @B0ydT, great initiative.

This is probably the transcriptomics hardest challenge (the one for tidySE is even harder).

Currently, group_by leads to a table in return, as I never approached the hard problem of making a proper API for this method.

This challenge needs a design document first, as we need to iron out all possible pitfalls of grouping an arbitrary set of columns and doing some operations on them. Some combinations may lead to valid single cell experiment objects, while others may lead to invalid objects, and therefore A tibble must be returned.

have a look to this workshop to learn more about tidySingleCellExperiment https://tidytranscriptomics-workshops.github.io/bioc2022_tidytranscriptomics/articles/tidytranscriptomics_case_study.html

Let's start from group_by, imagine how many ways a user could group_by across cell metadata and reduced dimensions, and how many ways it could operate on this object, e.g. mutate, summarize, something else.

Then pretty much you have creative freedom in proposing a design document, describing the logics of the group_by API for SingleCellExperiment

stemangiola commented 10 months ago

I'm interested in having a go at this. I've looked at the dplyr code and think I have a handle on it. Are there any implementations, especially part of tidyomics that you'd prefer I use as a starting point?

Any news? Let me know if you need help.

B0ydT commented 9 months ago

Hi @stemangiola. Apologies for the slow response, I've been in the middle of changing jobs. I certainly wasn't looking to jump on the thorniest issue of the lot; It took your comment for me to see how this could be quite a lot trickier than the dplyr implementation.

I have tried to think through the issues you've raised. From what I remember of the dplyr implementation, you define the groups when you group_by, but most error detection occurs when you attempt to perform an operation on the grouped object. If we're starting with with_groups as opposed to methods for each relevant function, I could catch those errors and provide informative feedback to the user. Otherwise, I'm very happy to help with implementation, but I'm not sure I'm the right person to work it through from first principles.

stemangiola commented 9 months ago

Hello @B0ydT, congrats for your new job!

You can start with group_split, it is useful and much easier.

A quick function for backend I found is

splitColData <- function(x, f) {
  # This is by @jma1991
  # at https://github.com/drisso/SingleCellExperiment/issues/55

  i <- split(seq_along(f), f)

  v <- vector(mode = "list", length = length(i))

  names(v) <- names(i)

  for (n in names(i)) { v[[n]] <- x[, i[[n]]] }

  return(v)

}

Of course our interface will allow any of the cell-wise info, including reduced dimensions, etc..

B0ydT commented 9 months ago

Thank you very much! That looks great. I'll get it sorted shortly.

B0ydT commented 6 months ago

Stoked to have this merged. I'll repurpose for tidyseurat and tidySE in the next couple days. I'm also pretty confident I could get a decent implementation of with_groups built on top of this. Thanks for all of your feedback, I've been learning heaps.

Just to follow up, the groups column itself is preserved just fine (if .keep = TRUE), but if I try and add a new column for logical comparisons it gets mangled. dplyr adds a column for these comparisons, i.e. a > 5

tibble(a = 1:10) |> group_by(a>5)
# A tibble: 10 × 2
# Groups:   a > 5 [2]
       a `a > 5`
   <int> <lgl>  
 1     1 FALSE  
 2     2 FALSE  
 3     3 FALSE  
 4     4 FALSE  
 5     5 FALSE  
 6     6 TRUE   
 7     7 TRUE   
 8     8 TRUE   
 9     9 TRUE   
10    10 TRUE

That said, if you're fine with it I'm fine with it, just wanted to be clear.

stemangiola commented 6 months ago

Stoked to have this merged. I'll repurpose for tidyseurat and tidySE in the next couple days.

Repurposing to tidyseurat will take 3 minutes.

Repurposing to tidySE might take 3 weeks.

For tidySE, to avoid inefficiencies (some large pseudobulk might include 1 Billion rows), specific flows for samples-only queries and features-only queries must be designed, if the query is samples and features, we should decide what to do. For example, (1) it does not make sense, or (2) use the inefficient method with message, (3)...

For tidySE look at nest, unnest, where I developed such strategies.

In fact you could get away using the nest quite optimised framework I developed. I would suggest doing sonmething like this in the backend


data |> mutate(...) |> nest(data_nested = -...) |> pull(data_nested)

B0ydT commented 6 months ago

Repurposing to tidySE might take 3 weeks.

Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.

stemangiola commented 6 months ago

Repurposing to tidySE might take 3 weeks.

Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.

On the other hand, if you manage to pull off my trick could take 5 minutes :)

william-hutchison commented 6 months ago

Hi @B0ydT, thanks for your great contribution! Please add your details to this authorship list if you would like to be included in our upcoming publication:

https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing

B0ydT commented 6 months ago

Thanks @william-hutchison, I will need to clear it with my supervisor but should complete it shortly.

B0ydT commented 6 months ago

@william-hutchison Have added my details, thanks again

william-hutchison commented 6 months ago

@B0ydT great, thanks for your work!

stemangiola / tidySingleCellExperiment

Implement `group_by` and `with_groups` for `tidySingleCellExperiment` and `tidyseurat` #71