pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
144 stars 23 forks source link

kallisto not recapitulating cellranger/STAR-based results #104

Closed vinay-swamy closed 3 years ago

vinay-swamy commented 3 years ago

Describe the issue Hi, This issue is more specifically about bustools.(if this is wrong repo for this, happy to move the issue to another, this one seems the most active) I've been using kallisto bustools for a project, and have been gettting some inconsistencies in the quantification output from bustools vs the quantification output from cellranger. I'm re-analyzing the data from this study, which is an analysis of amacrine cells from the mouse retina. The study is composed of 10xv2 scRNA-seq from 10 mice. The data I show below is from one of these mice(SRX8241388/MouseACS6), but is reflected in all samples from this study. [Link to bam file] (https://sra-download.ncbi.nlm.nih.gov/traces/sra43/SRZ/011680/SRR11680523/MouseACS6.bam ). All code used for quantification and analysis is available at this github repo.

We first noticed these inconsitencies when looking at the gene expression of the gene Arr3. Arr3 is well known to be specifcally expressed in cone cells (another retinal cell type), and we expect minimal expression of Arr3 in Amacrine cells, which is what we see in data quantified via cellranger. Both sets of data were analyzed using the same seurat pipeline (see linked repo )

cellranger_plot

However, when looking at kallisto derived quantification, we see significantly more Arr3 expression in amacrine cells.

kallisto_plot

This inconsistency is further reflected in low overlap highly variable genes of the two sets( 850/5000) and the somewhat low the median pearson correlation(.80) calculated using common genes and cells between the two. I'm curious if I've made a mistake in the kallisto pipeline, or if a problem like this has been identified before, and would appreciate any feedback on how to proceed

What is the exact command that was run?

kallisto bus -t 8 -x 10xv2 ...
bustools sort  ...
bustools correct ...
bustools count

Down stream analysis to generate the figs https://github.com/vinay-swamy/cellranger-vs-kallisto-SRP259930/blob/master/cellranger_vs_kallisto.Rmd

davemcg commented 3 years ago

Here are the two seurat objs (cellranger_seu and kallisto_seu), if that is helpful:

https://hpc.nih.gov/~mcgaugheyd/sanesAC_kb/seurat_objs.Rdata

We obviously can provide any files [up|down]stream to help

davemcg commented 3 years ago

The study's Fig 1 provides a handy retinal celltype <-> marker heatmap. You can see how this works for the cellranger output, but seems scrambled with Kallisto.

https://www.jneurosci.org/content/40/27/5177/tab-figures-data

Screen Shot 2021-03-04 at 10 47 15 AM Screen Shot 2021-03-04 at 10 49 43 AM Screen Shot 2021-03-04 at 10 50 44 AM
davemcg commented 3 years ago

Other observations (not certain if I'm looking at the wrong things):

Arr3 expression:

But the distribution of Arr3 is much "wider" in kallisto relative to cellranger

bind_rows(cellranger_seu@assays$RNA@counts['Arr3',] %>% enframe() %>% mutate(tech = 'CellRanger'), kallisto_seu@assays$RNA@counts['Arr3',] %>% enframe() %>% mutate(tech = 'kallisto')) %>% ggplot(aes(x=value,color=tech)) + geom_density()

plot_zoom_png

Looking at a high Arr3 cell in CellRanger is odd because Kallisto detects 0 counts in the same cell (sorry I borked the cell names)

> cellranger_seu@assays$RNA@counts['Arr3','MouseACS6_AATCGGTGTAAATGTG-1']
[1] 16
> kallisto_seu@assays$RNA@counts['Arr3','MouseACS6_MouseACS6_AATCGGTGTAAATGTG-1-1']
[1] 0
> cellranger_seu@assays$RNA@counts['Pax6','MouseACS6_AATCGGTGTAAATGTG-1']
[1] 4
> kallisto_seu@assays$RNA@counts['Pax6','MouseACS6_MouseACS6_AATCGGTGTAAATGTG-1-1']
[1] 4

We can also see the opposite where kallisto detects several Arr3 counts, but cellranger detects 0:


> kallisto_seu@assays$RNA@counts['Arr3','MouseACS6_MouseACS6_ACTATCTAGTCGTACT-1-1']
[1] 7
> cellranger_seu@assays$RNA@counts['Arr3','MouseACS6_ACTATCTAGTCGTACT-1']
[1] 0
davemcg commented 3 years ago

To rule out whether we messed up the idx creation I re-ran Vinny's work with your precomputed Mouse idx (tldr no big change)

(you can download the seurat file here: https://hpc.nih.gov/~mcgaugheyd/sanesAC_kb/kallisto_seu__pachterIDX.Rdata)

kb --help #version 0.25.1
# if it helps we could provide the fastq files
# 10x tool for creating fq pairs from a 10x bam
./bamtofastq-1.3.2 --nthreads 8 MouseACS6.bam fastq_MouseACS6 
# download pachter group computed idx
kb ref -d mouse -i index.idx -g t2g.txt
kb count -t 12 -i index.idx -g t2g.txt -x 10xv2 -o kb_standard --workflow standard --filter bustools \
fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz

R

library(BUSpaRse)
library(tidyverse)
library(Seurat)
run_seurat_pipeline <- function(mat, md){
  seu <- CreateSeuratObject(mat, meta.data = md)
  seu <- seu[,seu$nFeature_RNA > 200]
  seu <- NormalizeData(seu)
  seu <- FindVariableFeatures(seu, selection.method = "vst", nfeatures = 5000)
  seu <- ScaleData(seu)
  seu <- RunPCA(seu, npcs = 50,features = VariableFeatures(object = seu))
  keep_pcs = 1:20
  seu <- RunUMAP(seu, dims = keep_pcs)
  return(seu)
}
kallisto_data <- read_count_output('kb_standard/counts_filtered/', 'cells_x_genes')
## import sanes metadata
sanes_amacrine_labels <- read_csv('sanes_amacrine_subcluster_labels.csv', skip = 2, col_names = c('Barcode', 'sanes_subcluster'))
## convert gene ids to gene names 
t2g <- read_tsv('t2g.txt', col_names = FALSE) %>% data.frame()
t2g_filter <- t2g %>% select(X2, X3) %>% unique()
rownames(t2g_filter) <- t2g_filter$X2
rownames(kallisto_data) <- t2g_filter[rownames(kallisto_data), 'X3']
mouse <- "MouseACS6"
colnames(kallisto_data) <-  make.unique(paste0(mouse, '_', colnames(kallisto_data), '-1'))
ksto_msg_cells <- colnames(kallisto_data)[!colnames(kallisto_data) %in% sanes_amacrine_labels$Barcode]
kallisto_metadata <- bind_rows(sanes_amacrine_labels,
                                tibble(Barcode = ksto_msg_cells, sanes_subcluster = 'NonAmacrine')
                                ) %>% as.data.frame %>% 
  mutate(is_amacrine = ifelse(sanes_subcluster == 'NonAmacrine', 'NonAmacrine', 'Amacrine'))
rownames(kallisto_metadata) <- kallisto_metadata$Barcode
kallisto_seu__pachterIDX <- run_seurat_pipeline(kallisto_data,kallisto_metadata)
FeaturePlot(kallisto_seu__pachterIDX, features = c('Arr3', 'Pax6', 'Gfap', 'Vsx2')) + plot_annotation("kallisto pachter idx")
Screen Shot 2021-03-04 at 2 22 09 PM
davemcg commented 3 years ago

One thing that caught my eye was that half of the fastq files end in 002. I've never seen that before. I double checked the Illumina naming scheme (https://help.basespace.illumina.com/articles/descriptive/fastq-files/) and it says "001—The last segment is always 001."

So...weird. I'm re-running kb count with just the 001 file pairs to see if that changes the output.

davemcg commented 3 years ago

Yes. This is the way (I still don't know what is going on though).

*001.fastq.gz

Screen Shot 2021-03-04 at 3 54 25 PM

*002.fastq.gz

Screen Shot 2021-03-04 at 3 54 54 PM

This is weird:

davemcg commented 3 years ago

Were multiple experiments concatenated together? And the overlapping barcodes resulted in in silico gene<->cell mixing?

davemcg commented 3 years ago

While this good news for you this is bad news for us (or maybe just Vinny) because our process for ingesting 10x bams involved just collapsing all the fastq into one set of files...

davemcg commented 3 years ago

Related: https://github.com/10XGenomics/bamtofastq/issues/23

davemcg commented 3 years ago

Just to cap this, I re-ran ./bamtofastq-1.3.2 --nthreads 8 --reads-per-fastq=99999999999999999 MouseACS6.bam fastq_MouseACS6 (note the added --reads-per-fastq=99999999999999999).

The result is as expected.

Screen Shot 2021-03-04 at 8 48 52 PM

I'm still a bit miffed as to why...but in practice this resolves the issue.

sbooeshaghi commented 3 years ago

Awesome, I'm glad this is resolved. And just to clarify for anyone else encountering this thread, the issue is a bug in bamtofastq. It appears that when generating FASTQ records from a BAM file the barcodes and reads get scrambled (see the 10xgenomics issue above).

davemcg commented 3 years ago

Yes thanks Sina. Apologies for bringing kallisto into this mess. FWIW during this mess we got alevin running ... and it had a similar result.