stemangiola / tidySingleCellExperiment

Brings SingleCellExperiment objects to the tidyverse
https://stemangiola.github.io/tidySingleCellExperiment/index.html
34 stars 7 forks source link

Add conditional print method for objects that contain alternative experiments #89

Open Biomiha opened 10 months ago

Biomiha commented 10 months ago

Hi Stefano,

I've now made the printing conditional and have amended the show() unit test to assure that the default print method works. For the unit test I've added a random normcounts matrix to the existing pbmc_small sce object and named it ADT When printed to the console the header now looks like this:

   > pbmc_small
    `# A SingleCellExperiment-tibble abstraction: 80 × 17
    `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts

The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.

I'll start working on how to pull features from the alternative experiment slots into the tibble with join_features.

stemangiola commented 10 months ago

Hi Stefano,

I've now made the printing conditional and have amended the show() unit test to assure that the default print method works. For the unit test I've added a random normcounts matrix to the existing pbmc_small sce object and named it ADT When printed to the console the header now looks like this:

   > pbmc_small
    `# A SingleCellExperiment-tibble abstraction: 80 × 17
    `# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-normcounts

The print method takes any number of alternative experiments and appends the name to the assayNames contained therein, separated by a hyphen. I think this looks the tidiest. I hope you agree. If no alternative experiments exist, the print output is as before.

I'll start working on how to pull features from the alternative experiment slots into the tibble with join_features.

Amazing!

So if I can imagine a complex SCE with 3 expriments

we would have

`# A SingleCellExperiment-tibble abstraction: 80 × 17
`# Features=230 | Cells=80 | Assays=counts, logcounts, ADT-counts, ADT-logcounts, Hashtag demultiplex-counts, Hashtag demultiplex-logcounts

Am I interpreting correctly?

Biomiha commented 10 months ago

Yes, exactly. When I tested on a more complicated example I noticed that it creates a new line if the maximum line width is exceeded, but it still prints everything.

stemangiola commented 10 months ago

Yes, exactly.

Your solution looks great, and very intuitive, especially if experiment names are short.

Just to explore all possibilities, could you think about a solution that avoids the repetition of experiment names, so as to increase the scalability of our printout to many long experiment names?

Biomiha commented 10 months ago

Printing the name of the experiment instead, followed by the assayNames and separated by a semi-colon is easy enough. That would look something like:

     `# A SingleCellExperiment-tibble abstraction: 7,865 × 3
     `# Features=33538 | Cells=7865 | Assays=base: counts, logcounts; Antibody Capture: counts, logcounts; Hashtag demultiplex: counts, logcounts

However, like I said in my (edited) comment above, the native tibble print method will break lines if the character count gets too long (I can actually see a difference in how it's printed on my laptop vs. my wide-screen display). I think it's a matter of personal preference really. I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus. I also think we would need to find another name for base because it implies the base R vs. tidy R semantics, whereas here it's actually the main experiment and alternative experiments (and I find even this nomenclature pretty bad to be honest).

stemangiola commented 10 months ago

I find the colons and semi-colons a bit much and prefer the wider print but we can ask others and find a consensus.

I like the idea. Let's present both an "easy" and "hard" scenario to the community.

I also think we would need to find another name for base because it implies the base R vs. tidy R semantics

Agree.

I think slack Bioc tidy is the best place where to ask our community. But feel free to use any tool.

stemangiola commented 10 months ago

@Biomiha Thinking about it. You can leave it as is; maybe edit the example in the unit test to be the worst-case scenario. If it looks good in the worst-case scenario, there is no reason to change for the time being.

And instead, you can focus on the features such as join_feature, aggregate_cells, etc...

Biomiha commented 10 months ago

Okay, I've now mocked up the expression matrices for the additional alternative experiments and added that to the unit tests. In my hands it works as expected. Will move on to joining features next.

Biomiha commented 10 months ago

So, I think the join_features method is good to go. The way I envisaged it is that you can pick any feature for the long format and it automatically pulls the abundance for all the assays in that particular experiment. So if the features are from the ADT altExp it will just silently pull the values from all assays in the ADT altExp. The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons. For the wide shape, you obviously need to pick which assay to pull from and that makes things pretty straightforward. To avoid pulling assays with same name from different experiments I have chosen to use the same nomenclature that I used in the print method, e.g. logcounts, ADT-logcounts or HTO-logcounts. I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).

stemangiola commented 9 months ago

Well done!

The one thing I chose not to allow is the mixing of features from different experiments, which I think should be avoided for other reasons.

I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.

I have noticed however that there is a name.repair argument somewhere because at least for me features that have an illegal character get it changed to a "." in the wide format, for example my mock Ab features Ab-1 and Ab-2, etc... get changed to Ab.1, Ab.2, ... I think that could prove problematic because there are quite a few gene names that contain hyphens (like all mithchondrial genes).

Can you please make an example? And why this would be problematic?

Please also have a look to the conflicts.

Biomiha commented 9 months ago

I would think we can allow this. Imagine plotting the abundance of a transcript and the relative protein. Or is a gene has the same number for the transcript and the protein, we should pull both.

I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.

Can you please make an example? And why this would be problematic?

Example below:

library(tidySingleCellExperiment)
library(DropletTestFiles)
library(DropletUtils)
path <- getTestFile("tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz")
dir <- tempfile()
untar(path, exdir=dir)

sce <- read10xCounts(file.path(dir, "filtered_feature_bc_matrix"))
sce
sce <- splitAltExps(sce, rowData(sce)$Type)
rownames(sce) <- rowData(sce)$Symbol
sce |> 
  join_features(features = "MT-CO1", shape = "long")

sce |> 
  join_features(features = "MT-CO1", shape = "wide")

For me at least in the wide format I get:

.cell Sample Barcode MT.CO1

Notice the replacement of the - with the ..

Please also have a look to the conflicts.

Will do. Now that I am done with the backbone I will refactor the code to include the commits that were made in the interim.

Thanks

stemangiola commented 9 months ago

(

can you please provide this info for the authorship, if you haven't done already?

Email | Institution | First Name | Middle Name(s)/Initial(s) | Last Name | Suffix | Corresponding Author | Home Page URL | Collaborative Group/Consortium | ORCiD -- | -- | -- | -- | -- | -- | -- | -- | -- | -- )
stemangiola commented 9 months ago

I am still in two minds about it to be honest. I have now added the aggregate_cells function and have added the same check there as well. In general the one thing that needs to be assumed when pulling data not only from different experiments but also from different assays is that it will be normalised differently. In my experience users tend to just use whatever is in front of them so that could lead to misinterpretation. If you feel very strongly that you'd want to allow it then I can tweak the code to enable that.

A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.

Notice the replacement of the - with the ..

I believe is due to the Matrix of SCE that does not allow -. As we load the feature info there, I am not sure we can do much about that. Maybe print a warning message? So the user does not waste 30 minutes trying to debug this behaviour?

Biomiha commented 9 months ago

A good approach here is to see what tidyseurat does when we have multiomics, for example, with RNA and ADT. Would you be able to get an example dataset with ADT (e.g. azimuth PBMC I believe) and test your use cases on that? For example getting a transcript and a protein with the same query, or getting a feature ID that is the same for transcript and protein.

That's a good shout actually. Looking at it (please correct me if I am wrong), it seems that tidyseurat adds 2 columns called .abundance_RNA and .abundance_ADT and that seems to work fine (albeit with NAs everywhere) even when mixing features from different experiment types (or assays), whereas it seems to drop features from the ADT "assay" silently. If I generate the cbmc object from https://satijalab.org/seurat/articles/multimodal_vignette, running the following:

cbmc |> 
  join_features(features = c("A1BG-AS1", "CD3"), shape = "wide")

will only show A1BG-AS1.

Also in the tidyseurat print method the dash (-) is preserved whereas the output of

as.SingleCellExperiment(cbmc) |> 
  join_features(features = c("A1BG-AS1"), shape = "wide")

it is changed to a dot (.) . I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.

stemangiola commented 9 months ago

will only show A1BG-AS1.

That's definitely a bug :)

I don't actually think it is to do with the matrix colnames. When I stepped into the get_abundance_sc_wide function the output contained the dash (-) but when part of the join_features function it is changed to a dot.

Can you find which command is responsible?

I think for shape="long" we can copy what Seurat does, and as you proposed. NAs are fine. When shape = "wide" I believe we need to specify the assay, and we can encode (experiment + assay) the same way we print in the tibble.

Is anything else pending for discussion for the implementation?

Biomiha commented 9 months ago

Hi Stefano,

So I think I've managed to get the functions to work with any selected assay. I noticed that the aggregate_cells method had been refactored (yay!!) so I have used the new method and added the functionality to call assays from the altExps slots. Now the only thing that I think is outstanding is that when selecting all=TRUE in join_features it returns NULL. Not sure why.

To answer your question

Can you find which command is responsible?

It's the left_join here:

 suppressMessages({
      .data <-
        .data %>%
        left_join(                            # <= this one here
          by=c_(.data)$name,
          get_abundance_sc_long(
            .data=.data,
            features=features,
            all=all,
            exclude_zeros=exclude_zeros, 
            ...)) %>%
        select(!!c_(.data)$symbol, .feature,
               contains(".abundance"), everything())
    })
Biomiha commented 8 months ago

I have also made a stab at refactoring the aggregate_cells method. Was there a particular reason to turn the output into a SummarizedExperiment object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.

stemangiola commented 8 months ago

I have also made a stab at refactoring the aggregate_cells method. Was there a particular reason to turn the output into a SummarizedExperiment object? I can see that the equivalent method in the tidyseurat outputs a tibble instead.

I think it is elegant to switch from a single cell to a bulk container in Bioconductor, with the tidy interface bridging the two interfaces.

For tidyseurat, I don't think it is good to add SummarizedExperiment dependency, so it will not rely on bioconductor. However a seurat container could be used to store pseudobulk.

Biomiha commented 8 months ago

For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells function to output a SummarizedExperiment object. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.

stemangiola commented 8 months ago

For me personally it feels slightly counterintuitive to end up with a non-tidy output but I understand the concept and have amended the aggregate_cells function to output a SummarizedExperiment object. How would you like to handle the naming of the groups in cases where there are multiple selection columns (e.g. groups and ident)? For now the separator is "___" by default. Once we've finalised that I will update the unit test and let you know.

there are adapters for singleCellExperiment and summarizedExperiment, so we keep tidy all the way.

Biomiha commented 8 months ago

Do I need to use any other coercion method, other than SummarizedExperiment? Currently my output looks like this:

> df |> 
    aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))
$logcounts
class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(1): ''
rownames(230): MS4A1 CD79B ... SPON2 S100B
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):

$`ADT-logcounts`
class: SummarizedExperiment 
dim: 25 6 
metadata(0):
assays(1): ''
rownames(25): Ab-1 Ab-2 ... Ab-24 Ab-25
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(0):
stemangiola commented 8 months ago

Can a SummarizedExperiment have alternative experiment slot? If not ..

To keep consistent and always output a SummarizedExperiment rather than a list, I would use this strategy in case we have alternative experiments included

... |>
# Reshape to make RNA and ADT both features
        pivot_longer(
          cols = c(RNA, ADT), # dynamic assays
          names_to = "data_source",
          values_to = "count"
        ) |>
        filter(!count |> is.na()) |>

        # Preserve the original gene name for interpretation/dowanstream aalyses
       mutate(!!as.symbol(glue("{f_()$name}_original"))) :=  feature) |>

       # Make unique for SE container
       unite( !!f_()$symbol, c(feature, data_source), remove = FALSE) |>

        # Covert
        tidybulk::as_SummarizedExperiment(
          .sample = .sample, # these should be replaced with the dynamic gaming established in utilities.R
          .transcript = .feature,
          .abundance = count
        )

With a big message tidySingleCellExperiment says: ....

A user I will able to filter based on data_source == "RNA" for example

Biomiha commented 8 months ago

Do you think I'd need to change the tidybulk::as_SummarizedExperiment function to allow extra columns as well? Not necessarily something I want to do but I've noticed that it behaves strangely if multiple data sources are used. The function itself only takes 3 arguments, .sample, .transcript and .abundance. If I add the data_source it sometimes pushes it to the rowData slot but not every time and for some reason it only takes one type of data_source, whereas I can see in the tibble that I use as input into the tidybulk::as_SummarizedExperiment function that they are all there :confused: . The current commit should contain what you suggested but I am not 100% happy with its behaviour.

Biomiha commented 8 months ago

Just to elaborate, it seems that only when RNA and ADT are combined the output looks different (and not correct; the features are truncated to RNA features only) - last example below:

df |> 
  aggregate_cells(.sample = groups, assays = c("logcounts"))

class: SummarizedExperiment 
dim: 230 2 
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(2): g1 g2
colData names(2): groups data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = "logcounts")

class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(1): logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("counts", "logcounts"))

class: SummarizedExperiment 
dim: 230 6 
metadata(0):
assays(2): counts logcounts
rownames(230): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("ADT-counts", "ADT-logcounts"))

class: SummarizedExperiment 
dim: 25 6 
metadata(0):
assays(2): ADT-counts ADT-logcounts
rownames(25): Ab-1 Ab-10 ... Ab-8 Ab-9
rowData names(0):
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(3): groups ident data_source

df |> 
  aggregate_cells(.sample = c(groups, ident), assays = c("ADT-logcounts", "logcounts"))

class: SummarizedExperiment 
dim: 255 6 
metadata(0):
assays(2): logcounts ADT-logcounts
rownames(255): ACAP1 ACRBP ... ZNF330 ZNF76
rowData names(1): data_source
colnames(6): g1___0 g1___1 ... g2___1 g2___2
colData names(2): groups ident

Do you happen to know what could be causing this strange behaviour?

stemangiola commented 8 months ago

I am not sure I see the problem could you point that to me? 230 + 25 = 255

Biomiha commented 8 months ago

You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the data_source is in the rowData slot, whereas in the rest it's in the colData slot. I would almost favour everything being in the rowData slot instead. What do you think?

stemangiola commented 8 months ago

You're quite right, apologies, it doesn't truncate the features, it orders them somehow (?), hence why the ADT ones were hidden from the output (as well as in the head and tail, when I checked those). The one thing that still does not look consistent is that the last example in my list is the only one where the data_source is in the rowData slot, whereas in the rest it's in the colData slot. I would almost favour everything being in the rowData slot instead. What do you think?

colData includes info that are about samples, row data about genes. The dimensions should match with samples and genes, so its not about us to decide.

not sure about adt feature desapearing :(

Biomiha commented 8 months ago

Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :). I have now pushed all the changes (including the unit tests) to the altExp_methods branch. If you are happy with how it works I think it can be pulled into the master branch.

stemangiola commented 8 months ago

Well I suppose just the fact that it sticks the data_source column into either the colData or the rowData based on the dimensions to me would suggest it's not very robust :).

1) I would say, since you are adding the column internally, please add the column manually to rowData.

2) A small detail. Let's be very verbose in the messages (please tell me if I am misunderstanding)

"tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay names (stored in the column assay), to keep the rownames of the SingleCellExperiment unique. The original feature names are stored in the column XXX"

3) Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?

4) Sorry, I'm thinking, would it be better to call the data_source column assay?

Biomiha commented 8 months ago
  1. I would say, since you are adding the column internally, please add the column manually to rowData.

Done.

  1. A small detail. Let's be very verbose in the messages

Done. The warning now reads: tidySingleCellExperiment says: The selected assays have overlapping feature names. The feature names have been combined with the selected assay_type, to keep the rownames of the SingleCellExperiment unique. You can find the original feature names in the orig.feature.names column of the rowData slot of your object.

  1. Could you also confirm that this message is printed only when multiple Experiments & with overlapping gene names?

It is now :smiley: (I misunderstood what you meant before).

  1. Sorry, I'm thinking, would it be better to call the data_source column assay?

I have changed it to assay_type. I think the name is self-explanatory and still distinct from the word assay, which in SingleCellExperiment world refers to counts, logcounts, etc... i.e. even data transformations.

stemangiola commented 8 months ago

Amazing, we are getting close.

I submitted a review with comments. After this, only two thing will remain

1) I will review the code more generally for robustness and style 2) You will have to pass the unit tests (I believe they are failing at the moment). Are they gaining locally for you as well?

Biomiha commented 8 months ago

I think in general we might want to slowly start replacing the magrittr pipe %>% with the base pipe |>. I have left them as they were but worth considering refactoring that to future proof the code. All unit tests pass on my local machine. Are the unit tests failing on your machine?

stemangiola commented 8 months ago

ts pass on my local machi

I agree, but it's good to do it in another pull request, so we leave this readable with less changes as possible.

You can test unit tests on your github. Better do see there than locally.

[EDIT]

I had a look at your fork, and the github actions do not start, that is a pity. Not sure why.

Biomiha commented 8 months ago

I've now added additional comments to the code to make it hopefully clearer. In relation to the unit tests, I've never used github actions myself. If you install the package from my branch (I use: pak::pkg_install("Biomiha/tidySingleCellExperiment@altExp_methods")) do the unit tests work for you?

stemangiola commented 8 months ago

Amazing @Biomiha, this is huge PR! I was looking at the code, I will have quite a few questions.

But for now, can you see the check section on this web page?

image

If you click on the crosses you will see why the tests did not pass.

Biomiha commented 8 months ago

It looks like it can't build the vignette because it's missing the magick package it needs to crop some images. I can have a look but I would hazard a guess this is a separate issue.

Biomiha commented 4 months ago

I've now consolidated the differences since the two repos were diverging. For RNA only experiments it should produce the same results and for experiments with CITE-seq / Abseq data it should seamlessly add them to the output. If you don't mind I was going to ask the hive mind to kick the tyres a bit and test it on their data to see if there any edge cases that I have not accounted for. On an unrelated note - I have noticed that in cases where the parent sce object has multiple reducedDims and a lot of colData entries, it takes an exceedingly long time to even print the tidySCE object. Worth considering how to speed this up because I expect users might be turned away by this.

stemangiola commented 4 months ago

The first thing, is to see if your PR passes unit tests (the master branch does). Then, I will review the code to ensure it keeps a consistent style with the main repo (I might ask to change few things in case).

If you don't mind I was going to ask the hive mind to kick the tyres a bit and test it on their data to see if there any edge cases that I have not accounted for.

Sure go ahead!

On an unrelated note - I have noticed that in cases where the parent sce object has multiple reducedDims and a lot of colData entries, it takes an exceedingly long time to even print the tidySCE object. Worth considering how to speed this up because I expect users might be turned away by this.

Yes this effort is part of the challenges. But nobody has taken them. We need to optimise tidySCE and tidySE.