Closed vinay-swamy closed 3 years ago
Here are the two seurat objs (cellranger_seu
and kallisto_seu
), if that is helpful:
https://hpc.nih.gov/~mcgaugheyd/sanesAC_kb/seurat_objs.Rdata
We obviously can provide any files [up|down]stream to help
The study's Fig 1 provides a handy retinal celltype <-> marker heatmap. You can see how this works for the cellranger output, but seems scrambled with Kallisto.
https://www.jneurosci.org/content/40/27/5177/tab-figures-data
Other observations (not certain if I'm looking at the wrong things):
Arr3 expression:
But the distribution of Arr3 is much "wider" in kallisto relative to cellranger
bind_rows(cellranger_seu@assays$RNA@counts['Arr3',] %>% enframe() %>% mutate(tech = 'CellRanger'), kallisto_seu@assays$RNA@counts['Arr3',] %>% enframe() %>% mutate(tech = 'kallisto')) %>% ggplot(aes(x=value,color=tech)) + geom_density()
Looking at a high Arr3 cell in CellRanger is odd because Kallisto detects 0 counts in the same cell (sorry I borked the cell names)
> cellranger_seu@assays$RNA@counts['Arr3','MouseACS6_AATCGGTGTAAATGTG-1']
[1] 16
> kallisto_seu@assays$RNA@counts['Arr3','MouseACS6_MouseACS6_AATCGGTGTAAATGTG-1-1']
[1] 0
> cellranger_seu@assays$RNA@counts['Pax6','MouseACS6_AATCGGTGTAAATGTG-1']
[1] 4
> kallisto_seu@assays$RNA@counts['Pax6','MouseACS6_MouseACS6_AATCGGTGTAAATGTG-1-1']
[1] 4
We can also see the opposite where kallisto detects several Arr3 counts, but cellranger detects 0:
> kallisto_seu@assays$RNA@counts['Arr3','MouseACS6_MouseACS6_ACTATCTAGTCGTACT-1-1']
[1] 7
> cellranger_seu@assays$RNA@counts['Arr3','MouseACS6_ACTATCTAGTCGTACT-1']
[1] 0
To rule out whether we messed up the idx creation I re-ran Vinny's work with your precomputed Mouse idx (tldr no big change)
(you can download the seurat file here: https://hpc.nih.gov/~mcgaugheyd/sanesAC_kb/kallisto_seu__pachterIDX.Rdata)
kb --help #version 0.25.1
# if it helps we could provide the fastq files
# 10x tool for creating fq pairs from a 10x bam
./bamtofastq-1.3.2 --nthreads 8 MouseACS6.bam fastq_MouseACS6
# download pachter group computed idx
kb ref -d mouse -i index.idx -g t2g.txt
kb count -t 12 -i index.idx -g t2g.txt -x 10xv2 -o kb_standard --workflow standard --filter bustools \
fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz
R
library(BUSpaRse)
library(tidyverse)
library(Seurat)
run_seurat_pipeline <- function(mat, md){
seu <- CreateSeuratObject(mat, meta.data = md)
seu <- seu[,seu$nFeature_RNA > 200]
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu, selection.method = "vst", nfeatures = 5000)
seu <- ScaleData(seu)
seu <- RunPCA(seu, npcs = 50,features = VariableFeatures(object = seu))
keep_pcs = 1:20
seu <- RunUMAP(seu, dims = keep_pcs)
return(seu)
}
kallisto_data <- read_count_output('kb_standard/counts_filtered/', 'cells_x_genes')
## import sanes metadata
sanes_amacrine_labels <- read_csv('sanes_amacrine_subcluster_labels.csv', skip = 2, col_names = c('Barcode', 'sanes_subcluster'))
## convert gene ids to gene names
t2g <- read_tsv('t2g.txt', col_names = FALSE) %>% data.frame()
t2g_filter <- t2g %>% select(X2, X3) %>% unique()
rownames(t2g_filter) <- t2g_filter$X2
rownames(kallisto_data) <- t2g_filter[rownames(kallisto_data), 'X3']
mouse <- "MouseACS6"
colnames(kallisto_data) <- make.unique(paste0(mouse, '_', colnames(kallisto_data), '-1'))
ksto_msg_cells <- colnames(kallisto_data)[!colnames(kallisto_data) %in% sanes_amacrine_labels$Barcode]
kallisto_metadata <- bind_rows(sanes_amacrine_labels,
tibble(Barcode = ksto_msg_cells, sanes_subcluster = 'NonAmacrine')
) %>% as.data.frame %>%
mutate(is_amacrine = ifelse(sanes_subcluster == 'NonAmacrine', 'NonAmacrine', 'Amacrine'))
rownames(kallisto_metadata) <- kallisto_metadata$Barcode
kallisto_seu__pachterIDX <- run_seurat_pipeline(kallisto_data,kallisto_metadata)
FeaturePlot(kallisto_seu__pachterIDX, features = c('Arr3', 'Pax6', 'Gfap', 'Vsx2')) + plot_annotation("kallisto pachter idx")
One thing that caught my eye was that half of the fastq files end in 002. I've never seen that before. I double checked the Illumina naming scheme (https://help.basespace.illumina.com/articles/descriptive/fastq-files/) and it says "001—The last segment is always 001."
So...weird. I'm re-running kb count
with just the 001 file pairs to see if that changes the output.
Yes. This is the way (I still don't know what is going on though).
*001.fastq.gz
*002.fastq.gz
This is weird:
Were multiple experiments concatenated together? And the overlapping barcodes resulted in in silico gene<->cell mixing?
While this good news for you this is bad news for us (or maybe just Vinny) because our process for ingesting 10x bams involved just collapsing all the fastq into one set of files...
Just to cap this, I re-ran ./bamtofastq-1.3.2 --nthreads 8 --reads-per-fastq=99999999999999999 MouseACS6.bam fastq_MouseACS6
(note the added --reads-per-fastq=99999999999999999
).
The result is as expected.
I'm still a bit miffed as to why...but in practice this resolves the issue.
Awesome, I'm glad this is resolved. And just to clarify for anyone else encountering this thread, the issue is a bug in bamtofastq
. It appears that when generating FASTQ records from a BAM file the barcodes and reads get scrambled (see the 10xgenomics issue above).
Yes thanks Sina. Apologies for bringing kallisto
into this mess. FWIW during this mess we got alevin
running ... and it had a similar result.
Describe the issue Hi, This issue is more specifically about bustools.(if this is wrong repo for this, happy to move the issue to another, this one seems the most active) I've been using kallisto bustools for a project, and have been gettting some inconsistencies in the quantification output from bustools vs the quantification output from cellranger. I'm re-analyzing the data from this study, which is an analysis of amacrine cells from the mouse retina. The study is composed of 10xv2 scRNA-seq from 10 mice. The data I show below is from one of these mice(SRX8241388/MouseACS6), but is reflected in all samples from this study. [Link to bam file] (https://sra-download.ncbi.nlm.nih.gov/traces/sra43/SRZ/011680/SRR11680523/MouseACS6.bam ). All code used for quantification and analysis is available at this github repo.
We first noticed these inconsitencies when looking at the gene expression of the gene Arr3. Arr3 is well known to be specifcally expressed in cone cells (another retinal cell type), and we expect minimal expression of Arr3 in Amacrine cells, which is what we see in data quantified via cellranger. Both sets of data were analyzed using the same seurat pipeline (see linked repo )
However, when looking at kallisto derived quantification, we see significantly more Arr3 expression in amacrine cells.
This inconsistency is further reflected in low overlap highly variable genes of the two sets( 850/5000) and the somewhat low the median pearson correlation(.80) calculated using common genes and cells between the two. I'm curious if I've made a mistake in the kallisto pipeline, or if a problem like this has been identified before, and would appreciate any feedback on how to proceed
What is the exact command that was run?
Down stream analysis to generate the figs https://github.com/vinay-swamy/cellranger-vs-kallisto-SRP259930/blob/master/cellranger_vs_kallisto.Rmd