Open Biomiha opened 10 months ago
Hi Stefano,
I've now made the printing conditional and have amended the
show()
unit test to assure that the default print method works. For the unit test I've added a randomnormcounts
matrix to the existingpbmc_small
sce object and named itADT
When printed to the console the header now looks like this:> pbmc_small `# A SingleCellExperiment-tibble abstraction: 80 × 17 `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts
The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.
I'll start working on how to pull features from the alternative experiment slots into the tibble with
join_features
.
Amazing!
So if I can imagine a complex SCE with 3 expriments
we would have
`# A SingleCellExperiment-tibble abstraction: 80 × 17
`# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-counts, ADT-logcounts, Hashtag demultiplex-counts, Hashtag demultiplex-logcounts
Am I interpreting correctly?
Yes, exactly. When I tested on a more complicated example I noticed that it creates a new line if the maximum line width is exceeded, but it still prints everything.
Yes, exactly.
Your solution looks great, and very intuitive, especially if experiment names are short.
Just to explore all possibilities, could you think about a solution that avoids the repetition of experiment names, so as to increase the scalability of our printout to many long experiment names?
Printing the name of the experiment instead, followed by the assayNames and separated by a semi-colon is easy enough. That would look something like:
`# A SingleCellExperiment-tibble abstraction: 7,865 × 3
`# Features=33538 | Cells=7865 | Assays=base: counts, logcounts; Antibody Capture: counts, logcounts; Hashtag demultiplex: counts, logcounts
However, like I said in my (edited) comment above, the native tibble print method will break lines if the character count gets too long (I can actually see a difference in how it's printed on my laptop vs. my wide-screen display).
I think it's a matter of personal preference really. I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus. I also think we would need to find another name for base
because it implies the base R vs. tidy R semantics, whereas here it's actually the main experiment and alternative experiments (and I find even this nomenclature pretty bad to be honest).
I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus.
I like the idea. Let's present both an "easy" and "hard" scenario to the community.
I also think we would need to find another name for
base
because it implies the base R vs. tidy R semantics
Agree.
I think slack Bioc tidy is the best place where to ask our community. But feel free to use any tool.
@Biomiha Thinking about it. You can leave it as is; maybe edit the example in the unit test to be the worst-case scenario. If it looks good in the worst-case scenario, there is no reason to change for the time being.
And instead, you can focus on the features such as join_feature
, aggregate_cells
, etc...
Okay, I've now mocked up the expression matrices for the additional alternative experiments and added that to the unit tests. In my hands it works as expected. Will move on to joining features next.
So, I think the join_features
method is good to go. The way I envisaged it is that you can pick any feature for the long format and it automatically pulls the abundance for all the assays in that particular experiment. So if the features are from the ADT altExp it will just silently pull the values from all assays in the ADT altExp. The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.
For the wide shape, you obviously need to pick which assay to pull from and that makes things pretty straightforward. To avoid pulling assays with same name from different experiments I have chosen to use the same nomenclature that I used in the print method, e.g. logcounts
, ADT-logcounts
or HTO-logcounts
.
I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).
Well done!
The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.
I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.
I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).
Can you please make an example? And why this would be problematic?
Please also have a look to the conflicts.
I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.
I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.
Can you please make an example? And why this would be problematic?
Example below:
library(tidySingleCellExperiment)
library(DropletTestFiles)
library(DropletUtils)
path <- getTestFile("tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz")
dir <- tempfile()
untar(path, exdir=dir)
sce <- read10xCounts(file.path(dir, "filtered_feature_bc_matrix"))
sce
sce <- splitAltExps(sce, rowData(sce)$Type)
rownames(sce) <- rowData(sce)$Symbol
sce |>
join_features(features = "MT-CO1", shape = "long")
sce |>
join_features(features = "MT-CO1", shape = "wide")
For me at least in the wide format I get:
.cell Sample Barcode MT.CO1
Notice the replacement of the -
with the .
.
Please also have a look to the conflicts.
Will do. Now that I am done with the backbone I will refactor the code to include the commits that were made in the interim.
Thanks
(
can you please provide this info for the authorship, if you haven't done already?
I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.
A good approach here is to see what tidyseurat
does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.
Notice the replacement of the
-
with the.
.
I believe is due to the Matrix of SCE that does not allow -
. As we load the feature info there, I am not sure we can do much about that. Maybe print a warning message? So the user does not waste 30 minutes trying to debug this behaviour?
A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.
That's a good shout actually. Looking at it (please correct me if I am wrong), it seems that tidyseurat adds 2 columns called .abundance_RNA
and .abundance_ADT
and that seems to work fine (albeit with NAs everywhere) even when mixing features from different experiment types (or assays), whereas it seems to drop features from the ADT "assay" silently.
If I generate the cbmc
object from https://satijalab.org/seurat/articles/multimodal_vignette, running the following:
cbmc |>
join_features(features = c("A1BG-AS1", "CD3"), shape = "wide")
will only show A1BG-AS1
.
Also in the tidyseurat print method the dash (-) is preserved whereas the output of
as.SingleCellExperiment(cbmc) |>
join_features(features = c("A1BG-AS1"), shape = "wide")
it is changed to a dot (.) .
I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide
function the output contained the dash (-) but when part of the join_features
function it is changed to a dot.
will only show
A1BG-AS1
.
That's definitely a bug :)
I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.
Can you find which command is responsible?
I think for shape="long"
we can copy what Seurat does, and as you proposed. NAs are fine. When shape = "wide"
I believe we need to specify the assay, and we can encode (experiment + assay) the same way we print in the tibble.
Is anything else pending for discussion for the implementation?
Hi Stefano,
So I think I've managed to get the functions to work with any selected assay. I noticed that the aggregate_cells
method had been refactored (yay!!) so I have used the new method and added the functionality to call assays from the altExps
slots.
Now the only thing that I think is outstanding is that when selecting all=TRUE
in join_features
it returns NULL. Not sure why.
To answer your question
Can you find which command is responsible?
It's the left_join here:
suppressMessages({
.data <-
.data %>%
left_join( # <= this one here
by=c_(.data)$name,
get_abundance_sc_long(
.data=.data,
features=features,
all=all,
exclude_zeros=exclude_zeros,
...)) %>%
select(!!c_(.data)$symbol, .feature,
contains(".abundance"), everything())
})
I have also made a stab at refactoring the aggregate_cells
method. Was there a particular reason to turn the output into a SummarizedExperiment
object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.
I have also made a stab at refactoring the
aggregate_cells
method. Was there a particular reason to turn the output into aSummarizedExperiment
object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.
I think it is elegant to switch from a single cell to a bulk container in Bioconductor, with the tidy interface bridging the two interfaces.
For tidyseurat, I don't think it is good to add SummarizedExperiment dependency, so it will not rely on bioconductor. However a seurat container could be used to store pseudobulk.
For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells
function to output a SummarizedExperiment
object.
How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.
For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the
aggregate_cells
function to output aSummarizedExperiment
object. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.
there are adapters for singleCellExperiment and summarizedExperiment, so we keep tidy all the way.
Do I need to use any other coercion method, other than SummarizedExperiment
? Currently my output looks like this:
> df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
$logcounts
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(1): ''
rownames(230): MS4A1 CD79B ... SPON2 S100B
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):
$`ADT-logcounts`
class: SummarizedExperiment
dim: 25 6
metadata(0):
assays(1): ''
rownames(25): Ab-1 Ab-2 ... Ab-24 Ab-25
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):
Can a SummarizedExperiment
have alternative experiment slot? If not ..
To keep consistent and always output a SummarizedExperiment
rather than a list
, I would use this strategy in case we have alternative experiments included
... |>
# Reshape to make RNA and ADT both features
pivot_longer(
cols = c(RNA, ADT), # dynamic assays
names_to = "data_source",
values_to = "count"
) |>
filter(!count |> is.na()) |>
# Preserve the original gene name for interpretation/dowanstream aalyses
mutate(!!as.symbol(glue("{f_()$name}_original"))) := feature) |>
# Make unique for SE container
unite( !!f_()$symbol, c(feature, data_source), remove = FALSE) |>
# Covert
tidybulk::as_SummarizedExperiment(
.sample = .sample, # these should be replaced with the dynamic gaming established in utilities.R
.transcript = .feature,
.abundance = count
)
With a big message tidySingleCellExperiment says: ....
A user I will able to filter based on data_source
== "RNA" for example
Do you think I'd need to change the tidybulk::as_SummarizedExperiment
function to allow extra columns as well? Not necessarily something I want to do but I've noticed that it behaves strangely if multiple data sources are used. The function itself only takes 3 arguments, .sample, .transcript and .abundance. If I add the data_source it sometimes pushes it to the rowData slot but not every time and for some reason it only takes one type of data_source, whereas I can see in the tibble that I use as input into the tidybulk::as_SummarizedExperiment
function that they are all there :confused: .
The current commit should contain what you suggested but I am not 100% happy with its behaviour.
Just to elaborate, it seems that only when RNA and ADT are combined the output looks different (and not correct; the features are truncated to RNA features only) - last example below:
df |>
aggregate_cells(.sample = groups, assays = c("logcounts"))
class: SummarizedExperiment
dim: 230 2
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(2): g1 g2
colData names(2): groups data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = "logcounts")
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("counts", "logcounts"))
class: SummarizedExperiment
dim: 230 6
metadata(0):
assays(2): counts logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-counts", "ADT-logcounts"))
class: SummarizedExperiment
dim: 25 6
metadata(0):
assays(2): ADT-counts ADT-logcounts
rownames(25): Ab-1 Ab-10 ... Ab-8 Ab-9
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source
df |>
aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
class: SummarizedExperiment
dim: 255 6
metadata(0):
assays(2): logcounts ADT-logcounts
rownames(255): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(1): data_source
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(2): groups ident
Do you happen to know what could be causing this strange behaviour?
I am not sure I see the problem could you point that to me? 230 + 25 = 255
You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those).
The one thing that still does not look consistent is that the last example in my list is the only one where the data_source
is in the rowData
slot, whereas in the rest it's in the colData
slot. I would almost favour everything being in the rowData
slot instead. What do you think?
You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the
data_source
is in therowData
slot, whereas in the rest it's in thecolData
slot. I would almost favour everything being in therowData
slot instead. What do you think?
colData includes info that are about samples, row data about genes. The dimensions should match with samples and genes, so its not about us to decide.
not sure about adt feature desapearing :(
Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :). I have now pushed all the changes (including the unit tests) to the altExp_methods branch. If you are happy with how it works I think it can be pulled into the master branch.
Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :).
1) I would say, since you are adding the column internally, please add the column manually to rowData
.
2) A small detail. Let's be very verbose in the messages (please tell me if I am misunderstanding)
"tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay names (stored in the column assay
), to keep the rownames of the SingleCellExperiment unique. The original feature names are stored in the column XXX
"
3) Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?
4) Sorry, I'm thinking, would it be better to call the data_source
column assay
?
- I would say, since you are adding the column internally, please add the column manually to rowData.
Done.
- A small detail. Let's be very verbose in the messages
Done. The warning now reads: tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay_type, to keep the rownames of the SingleCellExperiment unique. You can find the original feature names in the orig.feature.names column of the rowData slot of your object.
- Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?
It is now :smiley: (I misunderstood what you meant before).
- Sorry, I'm thinking, would it be better to call the data_source column assay?
I have changed it to assay_type
. I think the name is self-explanatory and still distinct from the word assay
, which in SingleCellExperiment
world refers to counts, logcounts, etc... i.e. even data transformations.
Amazing, we are getting close.
I submitted a review with comments. After this, only two thing will remain
1) I will review the code more generally for robustness and style 2) You will have to pass the unit tests (I believe they are failing at the moment). Are they gaining locally for you as well?
I think in general we might want to slowly start replacing the magrittr pipe %>%
with the base pipe |>
. I have left them as they were but worth considering refactoring that to future proof the code.
All unit tests pass on my local machine. Are the unit tests failing on your machine?
ts pass on my local machi
I agree, but it's good to do it in another pull request, so we leave this readable with less changes as possible.
You can test unit tests on your github. Better do see there than locally.
[EDIT]
I had a look at your fork, and the github actions do not start, that is a pity. Not sure why.
I've now added additional comments to the code to make it hopefully clearer.
In relation to the unit tests, I've never used github actions myself. If you install the package from my branch (I use: pak::pkg_install("Biomiha/tidySingleCellExperiment@altExp_methods")
) do the unit tests work for you?
Amazing @Biomiha, this is huge PR! I was looking at the code, I will have quite a few questions.
But for now, can you see the check section on this web page?
If you click on the crosses you will see why the tests did not pass.
It looks like it can't build the vignette because it's missing the magick package it needs to crop some images. I can have a look but I would hazard a guess this is a separate issue.
I've now consolidated the differences since the two repos were diverging. For RNA only experiments it should produce the same results and for experiments with CITE-seq / Abseq data it should seamlessly add them to the output.
If you don't mind I was going to ask the hive mind to kick the tyres a bit and test it on their data to see if there any edge cases that I have not accounted for.
On an unrelated note - I have noticed that in cases where the parent sce
object has multiple reducedDims
and a lot of colData
entries, it takes an exceedingly long time to even print the tidySCE object. Worth considering how to speed this up because I expect users might be turned away by this.
The first thing, is to see if your PR passes unit tests (the master branch does). Then, I will review the code to ensure it keeps a consistent style with the main repo (I might ask to change few things in case).
If you don't mind I was going to ask the hive mind to kick the tyres a bit and test it on their data to see if there any edge cases that I have not accounted for.
Sure go ahead!
On an unrelated note - I have noticed that in cases where the parent
sce
object has multiplereducedDims
and a lot ofcolData
entries, it takes an exceedingly long time to even print the tidySCE object. Worth considering how to speed this up because I expect users might be turned away by this.
Yes this effort is part of the challenges. But nobody has taken them. We need to optimise tidySCE and tidySE.
Hi Stefano,
I've now made the printing conditional and have amended the
show()
unit test to assure that the default print method works. For the unit test I've added a randomnormcounts
matrix to the existingpbmc_small
sce object and named itADT
When printed to the console the header now looks like this:The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.
I'll start working on how to pull features from the alternative experiment slots into the tibble with
join_features
.