mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
20 stars 9 forks source link

use with ONT workflow? #16

Closed m-noonan closed 7 months ago

m-noonan commented 1 year ago

I have a cellxgene count matrix, a cellxtranscript count matrix, and a whitelist of high-quality barcodes from ONT's EPI2ME Labs' wf-single-cell. Can this be used to create a SingleCellExperiment object compatible with sc_annotate_plots and sc_DTU_analysis functions in FLAMES?

ChangqingW commented 1 year ago

I would recommend limma::diffsplice for DTU if you have multiple samples. For your cellxgene matrix and cellxtranscript, do they have the same cells? If the cells in cellxtranscript is a subset of the cells in the cellxgene matrix, you could use the combine_sce function to create a combined MultiAssayExperiment object. Otherwise you could make one yourself:

# example for when gene count matrix and transcript matrix have the same cells
# subset gene count matrix from example dataset to include only the same cells in transcript matrix
# sample dataset available in latest Github version: BiocManager::install("OliverVoogd/FLAMES")
scmixology_lib10_x <- scmixology_lib10[,colnames(scmixology_lib10_transcripts)]
sampleMap <- rbind(
  DataFrame(
    assay = "gene_counts", primary = colnames(scmixology_lib10_x),
    colname = colnames(scmixology_lib10_x)
  ),
  DataFrame(
    assay = "transcript_counts",
    primary = colnames(scmixology_lib10_transcripts), colname = colnames(scmixology_lib10_transcripts)
  )
)
colLib <- DataFrame(row.names = colnames(scmixology_lib10_x))
colLib$Lib_small <- T

x <- MultiAssayExperiment::MultiAssayExperiment(list(
  gene_counts = scmixology_lib10_x,
  transcript_counts = scmixology_lib10_transcripts
), sampleMap = sampleMap, colData = colLib)

sc_umap_expression(multiAssay = x, gene = "ENSG00000108107")

image

m-noonan commented 1 year ago

Thank you so much for this tutorial, @ChangqingW! I ran this on my own data but got the following error, would you mind helping me troubleshoot? Here is info on my MultiAssayExperiment object and the error I got with the backtrace.

A MultiAssayExperiment object of 2 listed experiments with user-defined names and respective classes. Containing an ExperimentList class object of length 2: [1] gene_counts: SingleCellExperiment with 23244 rows and 14974 columns [2] transcript_counts: SingleCellExperiment with 48062 rows and 14974 columns Functionality: experiments() - obtain the ExperimentList instance colData() - the primary/phenotype DataFrame sampleMap() - the sample coordination DataFrame $, [, [[ - extract colData columns, subset, or experiment *Format() - convert into a long or wide DataFrame assays() - convert ExperimentList to a SimpleList of matrices exportClass() - save data to flat files

sc_umap_expression(multiAssay = combo, gene = "Slc34a1") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in dplyr::filter(): ℹ In argument: gene_id == gene. Caused by error: ! object 'gene_id' not found Run rlang::last_trace() to see where the error occurred. Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'

Backtrace: ▆

  1. ├─FLAMES::sc_umap_expression(multiAssay = combo, gene = "Slc34a1")
  2. │ └─row_meta %>% dplyr::filter(gene_id == gene) %>% ...
  3. ├─dplyr::slice_max(., n = n_isoforms, order_by = mean)
  4. ├─dplyr::filter(., gene_id == gene)
  5. ├─dplyr:::filter.data.frame(., gene_id == gene)
  6. │ └─dplyr:::filter_rows(.data, dots, by)
  7. │ └─dplyr:::filter_eval(...)
  8. │ ├─base::withCallingHandlers(...)
  9. │ └─mask$eval_all_filter(dots, env_filter)
    1. │ └─dplyr (local) eval()
    2. └─base::.handleSimpleError(...)
    3. └─dplyr (local) h(simpleError(msg, call))
    4. └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
m-noonan commented 1 year ago

Also, could a previously generated UMAP projection/coordinates be added/used instead?

ChangqingW commented 1 year ago

Ops, will fix this for this in the next release. You should be able to get around this by adding a gene_id column to rowData with rowData(experiments(combo)$transcript_counts)$gene_id <- rownames(experiments(combo)$transcript_counts).

previously generated UMAP projection would be used if PCA and UMAP is present in reducedDimNames(experiments(combo)$gene_counts) and PCA has the same dimension as n_pcs.

I assume your gene counts and transcripts counts are from the exact same cells? We should be able to make a simpler wrapper for this scenario, not super proud with my current messy functions 😅

ChangqingW commented 1 year ago

The removing 14970 sampleMap rows not in names(experiments) message seem very concerning -- it is removing all your cells because the cell ids are not present in the sampleMap. Make sure you include samplemap and colData when creating your MultiAssayExperiment. Both colnames and primary would be just your colnames in this scenario.

sampleMap <- rbind(
  DataFrame(
    assay = "gene_counts", primary = colnames(scmixology_lib10_x),
    colname = colnames(scmixology_lib10_x)
  ),
  DataFrame(
    assay = "transcript_counts",
    primary = colnames(scmixology_lib10_transcripts), colname = colnames(scmixology_lib10_transcripts)
  )
)
colLib <- DataFrame(row.names = colnames(scmixology_lib10_x))
colLib$Lib_small <- T
m-noonan commented 1 year ago

So when I ran your example above it did a similar thing (removing 999 sampleMap rows, see below) and it didn't seem to affect the plotting.

sc_umap_expression(multiAssay = x, gene = "ENSG00000108107") Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 19 sampleMap rows with 'colname' not in colnames of experiments Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 999 sampleMap rows not in names(experiments) removing 19 colData rownames not in sampleMap 'primary' harmonizing input: removing 999 sampleMap rows not in names(experiments) removing 19 colData rownames not in sampleMap 'primary' harmonizing input: removing 980 sampleMap rows not in names(experiments) Constructing graphics... Plotting expression UMAPs ... Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata' 4: 'experiments' dropped; see 'metadata' 5: In (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : Ignoring unknown parameters: group.selfish

ChangqingW commented 1 year ago

Did you get your plots working?

m-noonan commented 1 year ago

Not yet, now I'm getting this error. I've tried a few other genes and it says the same thing for each one. What am I missing?

sc_umap_expression(multiAssay = combo, gene = "Emcn") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in sc_umap_expression(multiAssay = combo, gene = "Emcn") : Alternative isoform not found for the given gene: Emcn In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'

ChangqingW commented 1 year ago

I forgot to mention that the transcript count SCE need to have the following rowData columns:

m-noonan commented 1 year ago

Hi @ChangqingW, I added those rowData columns (hopefully I did it right...) but it's still giving me the same error and is removing rows. Here is my code below:

sce_gene <- SingleCellExperiment(gene_counts) sce_gene <- SingleCellExperiment(list(counts=gene_counts))

sce_tx <- SingleCellExperiment(tx_counts) sce_tx <- SingleCellExperiment(list(counts=tx_counts))

rowData(sce_tx)$transcript_id <- rownames(tx_counts) rowData(sce_tx)$FSM_match <- rownames(tx_counts)

library(rtracklayer) gtf <- rtracklayer::import("/home/users/mnoonan/refdata-gex-mm10-2020-A/genes/genes.gtf") gtf.df = as.data.frame(gtf) tx2gene = unique(gtf.df[ ,c("transcript_id","gene_name")])

head(tx2gene) transcript_id gene_name 1 Xkr4 2 ENSMUST00000162897 Xkr4 5 ENSMUST00000159265 Xkr4 8 ENSMUST00000070533 Xkr4 19 Gm1992 20 ENSMUST00000161581 Gm1992

df <- as.data.frame(transcript_id) tx2gene merge <- merge(df, tx2gene, by = "transcript_id")

dim(merge) [1] 48062 2 head(merge, 10) transcript_id gene_name 1 ENSMUST00000000001 Gnai3 2 ENSMUST00000000010 Hoxb9 3 ENSMUST00000000028 Cdc45 4 ENSMUST00000000033 Igf2 5 ENSMUST00000000049 Apoh 6 ENSMUST00000000058 Cav2 7 ENSMUST00000000080 Klf6 8 ENSMUST00000000087 Scmh1 9 ENSMUST00000000090 Cox5a 10 ENSMUST00000000095 Tbx2

rowData(sce_tx)$gene_id <- merge$gene_name

sce_tx class: SingleCellExperiment dim: 48062 14974 metadata(0): assays(1): counts rownames(48062): ENSMUST00000000001 ENSMUST00000000010 ... ENSMUST00000239167 ENSMUST00000239168 rowData names(3): transcript_id FSM_match gene_id colnames(14974): AAACCCAAGGGAGGGT AAACCCAAGGTGCAGT ... TTTGTTGTCAGGAAGC TTTGTTGTCGTGAGAG colData names(0): reducedDimNames(0): mainExpName: NULL altExpNames(0):

sampleMapreal <- rbind( data.frame( assay = "gene_counts", primary = colnames(sce_gene), colname = colnames(sce_gene) ), data.frame( assay = "transcript_counts", primary = colnames(sce_tx), colname = colnames(sce_tx) ) )

colLibnew <- data.frame(row.names = colnames(sce_gene)) colLibnew$Lib_small <- T

library(MultiAssayExperiment) combo <- MultiAssayExperiment::MultiAssayExperiment(list( gene_counts = sce_gene, transcript_counts = sce_tx ), sampleMap = sampleMapreal, colData = colLibnew)

rowData(experiments(combo)$transcript_counts)$gene_id <- rownames(experiments(combo)$transcript_counts)

sc_umap_expression(multiAssay = combo, gene = "Xkr4") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in sc_umap_expression(multiAssay = combo, gene = "Xkr4") : Alternative isoform not found for the given gene: Xkr4 In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'

ChangqingW commented 1 year ago

Could you try sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4]?

m-noonan commented 1 year ago

It doesn't work, something to do with the brackets. I did both below and it said "unexpected '[' in"

sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4] sc_umap_expression(multiAssay = combo, gene = [gene_id of Xkr4])

ChangqingW commented 1 year ago

Sorry, I meant the actual gene id of Xkr4, I think it is ENSG00000206579, so

sc_umap_expression(multiAssay = combo, gene = ENSG00000206579)
m-noonan commented 1 year ago

Getting closer! It's at least able to find the gene (this is Slc34a1, mouse) and found 3 isoforms but now running into a different error:

sc_umap_expression(multiAssay = combo, gene = "ENSMUSG00000021490") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14971 sampleMap rows not in names(experiments) removing 1 colData rownames not in sampleMap 'primary' Only 3 isoforms found for this gene.

Error in plot_isoforms$layers[[length(plot_isoforms$layers)]] : attempt to select less than one element in integerOneIndex In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata' 4: 'experiments' dropped; see 'metadata' 5: In max(xlim) : no non-missing arguments to max; returning -Inf

ChangqingW commented 1 year ago

This is MultiAssayExperiment removing all your cells because the cell names are not present in the primary column of the sampleMap.

ChangqingW commented 1 year ago

Sorry, I think this error is actually due to this line https://github.com/mritchielab/FLAMES/blob/44587e105a3ace42186fec0970a4b9ccd37f70b6/R/sc_annotate_plots.R#L289 which I use to hack ggbio's order of isoforms. Not sure why plot_isoforms$layers[[length(plot_isoforms$layers)]] is causing error for you though.

ChangqingW commented 1 year ago

Could be that rowRanges was not provided for your transcript count SCE. The rowRanges of the isoforms are passed to ggbio::autoplot to make the isoform annotations. You could add them by loading your GTF with rtracklayer::import and attaching them to the corresponding rows of your transcript count SCE (rowRanges(sce) <-). More details are on the Orchestrating Single-Cell Analysis with Bioconductor book chapter 4.2.4

m-noonan commented 1 year ago

Thanks @ChangqingW, that helped! But I'm missing ~200 gene ids somewhere. Do you know where this is referencing to?

sc_umap_expression(multiAssay = combo, gene = "ENSMUSG00000021490") harmonizing input: removing 4 sampleMap rows with 'colname' not in colnames of experiments Running PCA for experiments(multiAssay)$gene_counts ... harmonizing input: removing 3 sampleMap rows with 'colname' not in colnames of experiments removing 3 colData rownames not in sampleMap 'primary' Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 14970 sampleMap rows not in names(experiments) harmonizing input: removing 14970 sampleMap rows not in names(experiments) Error in dplyr::filter(): ℹ In argument: gene_id == gene. Caused by error: ! ..1 must be of size 48062 or 1, not size 47844. Run rlang::last_trace() to see where the error occurred. Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'

ChangqingW commented 1 year ago

I think we are almost there! You can identify the missing ones via is.na(rowData({your transcript SCE object})$gene_id), and if they are not relevant, you can throw them away via {your transcript SCE object} <- {your transcript SCE object}[!is.na(rowData({your transcript SCE object})$gene_id),]

m-noonan commented 1 year ago

Hi @ChangqingW, I tried using combine_sce of my gene and transcript SCE objects this way then did sc_umap_expression but got the "alternative isoform not found" error

combine <- combine_sce(short_read_large = multi_gene, short_read_small = multi_gene, long_read_sce = multi_tx) rowData(experiments(combine)$transcript_counts)$gene_id <- rownames(experiments(combine)$transcript_counts) sc_umap_expression(gene = "ENSMUSG00000024097", multiAssay = combine)

Running PCA for experiments(multiAssay)$gene_counts ... Running UMAP for experiments(multiAssay)$gene_counts ... harmonizing input: removing 5576 sampleMap rows not in names(experiments) removing 35 colData rownames not in sampleMap 'primary' harmonizing input: removing 5576 sampleMap rows not in names(experiments) removing 35 colData rownames not in sampleMap 'primary' Error in sc_umap_expression(gene = "ENSMUSG00000024097", multiAssay = combine) : Alternative isoform not found for the given gene: ENSMUSG00000024097 In addition: Warning messages: 1: useNames = NA is deprecated. Instead, specify either useNames = TRUE or useNames = TRUE. 2: 'experiments' dropped; see 'metadata' 3: 'experiments' dropped; see 'metadata'

If I share my counts matrices with you, could you try it on your end? Let me know, thanks!

m-noonan commented 1 year ago

Wait! I just ran it without the rowData(experiments(combine)$transcript_counts)$gene_id <- rownames(experiments(combine)$transcript_counts) line and it gave me a plot! It's just transcript info and the isoforms look weird but it's a start! Thoughts on how to get the gene expression and correct isoforms? plot_zoom_png

ChangqingW commented 1 year ago

Could you grep the relevant lines from your GTF / GFF file? It seems that the annotation does not contain exon information, or the isoforms are annotated as entire exons? The UMAP also seems weird -- the gray area are meant to denote cells not sampled for long-read sequencing in FLT-seq so this is probably a bug in the code, I am planning to make this function more protocol-agnostic but a bit busy at the moment.