Closed m-noonan closed 7 months ago
I would recommend limma::diffsplice
for DTU if you have multiple samples.
For your cellxgene matrix and cellxtranscript, do they have the same cells? If the cells in cellxtranscript is a subset of the cells in the cellxgene matrix, you could use the combine_sce function to create a combined MultiAssayExperiment object. Otherwise you could make one yourself:
# example for when gene count matrix and transcript matrix have the same cells
# subset gene count matrix from example dataset to include only the same cells in transcript matrix
# sample dataset available in latest Github version: BiocManager::install("OliverVoogd/FLAMES")
scmixology_lib10_x <- scmixology_lib10[,colnames(scmixology_lib10_transcripts)]
sampleMap <- rbind(
DataFrame(
assay = "gene_counts", primary = colnames(scmixology_lib10_x),
colname = colnames(scmixology_lib10_x)
),
DataFrame(
assay = "transcript_counts",
primary = colnames(scmixology_lib10_transcripts), colname = colnames(scmixology_lib10_transcripts)
)
)
colLib <- DataFrame(row.names = colnames(scmixology_lib10_x))
colLib$Lib_small <- T
x <- MultiAssayExperiment::MultiAssayExperiment(list(
gene_counts = scmixology_lib10_x,
transcript_counts = scmixology_lib10_transcripts
), sampleMap = sampleMap, colData = colLib)
sc_umap_expression(multiAssay = x, gene = "ENSG00000108107")
Thank you so much for this tutorial, @ChangqingW! I ran this on my own data but got the following error, would you mind helping me troubleshoot? Here is info on my MultiAssayExperiment object and the error I got with the backtrace.
A MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] gene_counts: SingleCellExperiment with 23244 rows and 14974 columns
[2] transcript_counts: SingleCellExperiment with 48062 rows and 14974 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
$
, [
, [[
- extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
sc_umap_expression(multiAssay = combo, gene = "Slc34a1") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in
dplyr::filter()
: ℹ In argument:gene_id == gene
. Caused by error: ! object 'gene_id' not found Runrlang::last_trace()
to see where the error occurred. Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'
Backtrace: ▆
Also, could a previously generated UMAP projection/coordinates be added/used instead?
Ops, will fix this for this in the next release. You should be able to get around this by adding a gene_id
column to rowData
with rowData(experiments(combo)$transcript_counts)$gene_id <- rownames(experiments(combo)$transcript_counts)
.
previously generated UMAP projection would be used if PCA
and UMAP
is present in reducedDimNames(experiments(combo)$gene_counts)
and PCA
has the same dimension as n_pcs
.
I assume your gene counts and transcripts counts are from the exact same cells? We should be able to make a simpler wrapper for this scenario, not super proud with my current messy functions 😅
The removing 14970 sampleMap rows not in names(experiments)
message seem very concerning -- it is removing all your cells because the cell ids are not present in the sampleMap. Make sure you include samplemap and colData when creating your MultiAssayExperiment
. Both colnames
and primary
would be just your colnames in this scenario.
sampleMap <- rbind(
DataFrame(
assay = "gene_counts", primary = colnames(scmixology_lib10_x),
colname = colnames(scmixology_lib10_x)
),
DataFrame(
assay = "transcript_counts",
primary = colnames(scmixology_lib10_transcripts), colname = colnames(scmixology_lib10_transcripts)
)
)
colLib <- DataFrame(row.names = colnames(scmixology_lib10_x))
colLib$Lib_small <- T
So when I ran your example above it did a similar thing (removing 999 sampleMap rows, see below) and it didn't seem to affect the plotting.
sc_umap_expression(multiAssay = x, gene = "ENSG00000108107") Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 19 sampleMap rows with 'colname' not in colnames of experiments Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 999 sampleMap rows not in names(experiments) removing 19 colData rownames not in sampleMap 'primary' harmonizing input: removing 999 sampleMap rows not in names(experiments) removing 19 colData rownames not in sampleMap 'primary' harmonizing input: removing 980 sampleMap rows not in names(experiments) Constructing graphics... Plotting expression UMAPs ... Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata' 4: 'experiments' dropped; see 'metadata' 5: In (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : Ignoring unknown parameters:
group.selfish
Did you get your plots working?
Not yet, now I'm getting this error. I've tried a few other genes and it says the same thing for each one. What am I missing?
sc_umap_expression(multiAssay = combo, gene = "Emcn") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in sc_umap_expression(multiAssay = combo, gene = "Emcn") : Alternative isoform not found for the given gene: Emcn In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'
I forgot to mention that the transcript count SCE need to have the following rowData columns:
Hi @ChangqingW, I added those rowData columns (hopefully I did it right...) but it's still giving me the same error and is removing rows. Here is my code below:
sce_gene <- SingleCellExperiment(gene_counts) sce_gene <- SingleCellExperiment(list(counts=gene_counts))
sce_tx <- SingleCellExperiment(tx_counts) sce_tx <- SingleCellExperiment(list(counts=tx_counts))
rowData(sce_tx)$transcript_id <- rownames(tx_counts) rowData(sce_tx)$FSM_match <- rownames(tx_counts)
library(rtracklayer) gtf <- rtracklayer::import("/home/users/mnoonan/refdata-gex-mm10-2020-A/genes/genes.gtf") gtf.df = as.data.frame(gtf) tx2gene = unique(gtf.df[ ,c("transcript_id","gene_name")])
head(tx2gene) transcript_id gene_name 1
Xkr4 2 ENSMUST00000162897 Xkr4 5 ENSMUST00000159265 Xkr4 8 ENSMUST00000070533 Xkr4 19 Gm1992 20 ENSMUST00000161581 Gm1992
df <- as.data.frame(transcript_id) tx2gene merge <- merge(df, tx2gene, by = "transcript_id")
dim(merge) [1] 48062 2 head(merge, 10) transcript_id gene_name 1 ENSMUST00000000001 Gnai3 2 ENSMUST00000000010 Hoxb9 3 ENSMUST00000000028 Cdc45 4 ENSMUST00000000033 Igf2 5 ENSMUST00000000049 Apoh 6 ENSMUST00000000058 Cav2 7 ENSMUST00000000080 Klf6 8 ENSMUST00000000087 Scmh1 9 ENSMUST00000000090 Cox5a 10 ENSMUST00000000095 Tbx2
rowData(sce_tx)$gene_id <- merge$gene_name
sce_tx class: SingleCellExperiment dim: 48062 14974 metadata(0): assays(1): counts rownames(48062): ENSMUST00000000001 ENSMUST00000000010 ... ENSMUST00000239167 ENSMUST00000239168 rowData names(3): transcript_id FSM_match gene_id colnames(14974): AAACCCAAGGGAGGGT AAACCCAAGGTGCAGT ... TTTGTTGTCAGGAAGC TTTGTTGTCGTGAGAG colData names(0): reducedDimNames(0): mainExpName: NULL altExpNames(0):
sampleMapreal <- rbind( data.frame( assay = "gene_counts", primary = colnames(sce_gene), colname = colnames(sce_gene) ), data.frame( assay = "transcript_counts", primary = colnames(sce_tx), colname = colnames(sce_tx) ) )
colLibnew <- data.frame(row.names = colnames(sce_gene)) colLibnew$Lib_small <- T
library(MultiAssayExperiment) combo <- MultiAssayExperiment::MultiAssayExperiment(list( gene_counts = sce_gene, transcript_counts = sce_tx ), sampleMap = sampleMapreal, colData = colLibnew)
rowData(experiments(combo)$transcript_counts)$gene_id <- rownames(experiments(combo)$transcript_counts)
sc_umap_expression(multiAssay = combo, gene = "Xkr4") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in sc_umap_expression(multiAssay = combo, gene = "Xkr4") : Alternative isoform not found for the given gene: Xkr4 In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'
Could you try sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4]
?
It doesn't work, something to do with the brackets. I did both below and it said "unexpected '[' in"
sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4]
sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4])
Sorry, I meant the actual gene id of Xkr4, I think it is ENSG00000206579
, so
sc_umap_expression(multiAssay = combo, gene = ENSG00000206579)
Getting closer! It's at least able to find the gene (this is Slc34a1, mouse) and found 3 isoforms but now running into a different error:
sc_umap_expression(multiAssay = combo, gene = "ENSMUSG00000021490") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14971 sampleMap rows not in names(experiments) removing 1 colData rownames not in sampleMap 'primary' Only 3 isoforms found for this gene.
Error in plot_isoforms$layers[[length(plot_isoforms$layers)]] : attempt to select less than one element in integerOneIndex In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata' 4: 'experiments' dropped; see 'metadata' 5: In max(xlim) : no non-missing arguments to max; returning -Inf
This is MultiAssayExperiment
removing all your cells because the cell names are not present in the primary column of the sampleMap
.
Sorry, I think this error is actually due to this line
https://github.com/mritchielab/FLAMES/blob/44587e105a3ace42186fec0970a4b9ccd37f70b6/R/sc_annotate_plots.R#L289
which I use to hack ggbio's order of isoforms. Not sure why plot_isoforms$layers[[length(plot_isoforms$layers)]]
is causing error for you though.
Could be that rowRanges
was not provided for your transcript count SCE. The rowRanges
of the isoforms are passed to ggbio::autoplot
to make the isoform annotations. You could add them by loading your GTF with rtracklayer::import
and attaching them to the corresponding rows of your transcript count SCE (rowRanges(sce) <-
). More details
are on the Orchestrating Single-Cell Analysis with Bioconductor book chapter 4.2.4
Thanks @ChangqingW, that helped! But I'm missing ~200 gene ids somewhere. Do you know where this is referencing to?
sc_umap_expression(multiAssay = combo, gene = "ENSMUSG00000021490")
harmonizing input:
removing 4 sampleMap rows with 'colname' not in colnames of experiments
Running PCA for experiments(multiAssay)$gene_counts ...
harmonizing input:
removing 3 sampleMap rows with 'colname' not in colnames of experiments
removing 3 colData rownames not in sampleMap 'primary'
Running UMAP for experiments(multiAssay)$gene_counts ...
harmonizing input:
removing 14970 sampleMap rows not in names(experiments)
harmonizing input:
removing 14970 sampleMap rows not in names(experiments)
Error in dplyr::filter()
:
ℹ In argument: gene_id == gene
.
Caused by error:
! ..1
must be of size 48062 or 1, not size 47844.
Run rlang::last_trace()
to see where the error occurred.
Warning messages:
1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE.
2: 'experiments' dropped; see 'metadata'
3: 'experiments' dropped; see 'metadata'
I think we are almost there! You can identify the missing ones via is.na(rowData({your transcript SCE object})$gene_id)
, and if they are not relevant, you can throw them away via {your transcript SCE object} <- {your transcript SCE object}[!is.na(rowData({your transcript SCE object})$gene_id),]
Hi @ChangqingW, I tried using combine_sce of my gene and transcript SCE objects this way then did sc_umap_expression but got the "alternative isoform not found" error
combine <- combine_sce(short_read_large = multi_gene, short_read_small = multi_gene, long_read_sce = multi_tx) rowData(experiments(combine)$transcript_counts)$gene_id <- rownames(experiments(combine)$transcript_counts) sc_umap_expression(gene = "ENSMUSG00000024097", multiAssay = combine)
Running PCA for experiments(multiAssay)$gene_counts ... Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 5576 sampleMap rows not in names(experiments) removing 35 colData rownames not in sampleMap 'primary' harmonizing input: removing 5576 sampleMap rows not in names(experiments) removing 35 colData rownames not in sampleMap 'primary' Error in sc_umap_expression(gene = "ENSMUSG00000024097", multiAssay = combine) : Alternative isoform not found for the given gene: ENSMUSG00000024097 In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'
If I share my counts matrices with you, could you try it on your end? Let me know, thanks!
Wait! I just ran it without the rowData(experiments(combine)$transcript_counts)$gene_id <- rownames(experiments(combine)$transcript_counts) line and it gave me a plot! It's just transcript info and the isoforms look weird but it's a start! Thoughts on how to get the gene expression and correct isoforms?
Could you grep the relevant lines from your GTF / GFF file? It seems that the annotation does not contain exon information, or the isoforms are annotated as entire exons? The UMAP also seems weird -- the gray area are meant to denote cells not sampled for long-read sequencing in FLT-seq so this is probably a bug in the code, I am planning to make this function more protocol-agnostic but a bit busy at the moment.
I have a cellxgene count matrix, a cellxtranscript count matrix, and a whitelist of high-quality barcodes from ONT's EPI2ME Labs' wf-single-cell. Can this be used to create a SingleCellExperiment object compatible with sc_annotate_plots and sc_DTU_analysis functions in FLAMES?