Open stemangiola opened 11 months ago
group_split
is also a new useful function
I'm interested in having a go at this. I've looked at the dplyr
code and think I have a handle on it. Are there any implementations, especially part of tidyomics
that you'd prefer I use as a starting point?
Hello @B0ydT, great initiative.
This is probably the transcriptomics hardest challenge (the one for tidySE is even harder).
Currently, group_by
leads to a table in return, as I never approached the hard problem of making a proper API for this method.
This challenge needs a design document first, as we need to iron out all possible pitfalls of grouping an arbitrary set of columns and doing some operations on them. Some combinations may lead to valid single cell experiment objects, while others may lead to invalid objects, and therefore A tibble
must be returned.
tidySingleCellExperiment
https://tidytranscriptomics-workshops.github.io/bioc2022_tidytranscriptomics/articles/tidytranscriptomics_case_study.htmlLet's start from group_by
, imagine how many ways a user could group_by
across cell metadata and reduced dimensions, and how many ways it could operate on this object, e.g. mutate
, summarize
, something else.
Then pretty much you have creative freedom in proposing a design document, describing the logics of the group_by
API for SingleCellExperiment
I'm interested in having a go at this. I've looked at the
dplyr
code and think I have a handle on it. Are there any implementations, especially part oftidyomics
that you'd prefer I use as a starting point?
Any news? Let me know if you need help.
Hi @stemangiola. Apologies for the slow response, I've been in the middle of changing jobs. I certainly wasn't looking to jump on the thorniest issue of the lot; It took your comment for me to see how this could be quite a lot trickier than the dplyr
implementation.
I have tried to think through the issues you've raised. From what I remember of the dplyr
implementation, you define the groups when you group_by
, but most error detection occurs when you attempt to perform an operation on the grouped object. If we're starting with with_groups
as opposed to methods for each relevant function, I could catch those errors and provide informative feedback to the user. Otherwise, I'm very happy to help with implementation, but I'm not sure I'm the right person to work it through from first principles.
Hello @B0ydT, congrats for your new job!
You can start with group_split, it is useful and much easier.
A quick function for backend I found is
splitColData <- function(x, f) {
# This is by @jma1991
# at https://github.com/drisso/SingleCellExperiment/issues/55
i <- split(seq_along(f), f)
v <- vector(mode = "list", length = length(i))
names(v) <- names(i)
for (n in names(i)) { v[[n]] <- x[, i[[n]]] }
return(v)
}
Of course our interface will allow any of the cell-wise info, including reduced dimensions, etc..
Thank you very much! That looks great. I'll get it sorted shortly.
Stoked to have this merged. I'll repurpose for tidyseurat
and tidySE
in the next couple days. I'm also pretty confident I could get a decent implementation of with_groups
built on top of this. Thanks for all of your feedback, I've been learning heaps.
Just to follow up, the groups
column itself is preserved just fine (if .keep = TRUE
), but if I try and add a new column for logical comparisons it gets mangled. dplyr
adds a column for these comparisons, i.e. a > 5
tibble(a = 1:10) |> group_by(a>5)
# A tibble: 10 × 2
# Groups: a > 5 [2]
a `a > 5`
<int> <lgl>
1 1 FALSE
2 2 FALSE
3 3 FALSE
4 4 FALSE
5 5 FALSE
6 6 TRUE
7 7 TRUE
8 8 TRUE
9 9 TRUE
10 10 TRUE
That said, if you're fine with it I'm fine with it, just wanted to be clear.
Stoked to have this merged. I'll repurpose for
tidyseurat
andtidySE
in the next couple days.
Repurposing to tidyseurat
will take 3 minutes.
Repurposing to tidySE
might take 3 weeks.
For tidySE
, to avoid inefficiencies (some large pseudobulk might include 1 Billion rows), specific flows for samples-only queries and features-only queries must be designed, if the query is samples and features, we should decide what to do. For example, (1) it does not make sense, or (2) use the inefficient method with message, (3)...
For tidySE
look at nest, unnest, where I developed such strategies.
In fact you could get away using the nest quite optimised framework I developed. I would suggest doing sonmething like this in the backend
data |> mutate(...) |> nest(data_nested = -...) |> pull(data_nested)
Repurposing to
tidySE
might take 3 weeks.
Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.
Repurposing to
tidySE
might take 3 weeks.Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.
On the other hand, if you manage to pull off my trick could take 5 minutes :)
Hi @B0ydT, thanks for your great contribution! Please add your details to this authorship list if you would like to be included in our upcoming publication:
https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing
Thanks @william-hutchison, I will need to clear it with my supervisor but should complete it shortly.
@william-hutchison Have added my details, thanks again
@B0ydT great, thanks for your work!
the issue for tidyseurat is here
https://github.com/stemangiola/tidyseurat/issues/65