pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
631 stars 168 forks source link

kallisto gene vs transcript count #434

Open Rajesh-HKU opened 2 months ago

Rajesh-HKU commented 2 months ago

Dear Laura,

There is a discrepancy between the gene count and transcript count from the count matrices generated by Kallisto. I generated the gene count matrix and tcc matrix from the same set of fastq files. Subsequently, I compared the gene count vs the total of transcript count. I used only the equivalence classes which map only to the one gene of interest. I found that these counts do not match. Surprisingly, even when the transcript count, based on the equivalence classes which only map to the gene of interest, is greater than zero, the gene count for the corresponding cells is zero.

The details are as follows:

Software versions: Kb_python 0.28.2 Kallisto 0.48

Count Matrix Commands: The code reads a list of fastq files and creates a count matrix of either genes or tcc associated with each specimen and saves the file with the specimen prefix.

Gene count matrix :
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t 2 --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE) my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_genes") file.copy(from = paste0("./counts_unfiltered/", my_files), to = paste0("./cellout/unfcounts/",specimens[2*i],"", my_files))

Tcc_count matrix:
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t2 --tcc --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE)
my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_tcc")
file.copy(from = paste0("./counts_unfiltered/", my_files), to=paste0("./cell_out/unfcountstcc/",specimendata$specimenID[i],"", my_files))

For the purpose of comparison of a particular gene vs tcc count, we only included the equivalent classes which had one gene mapping only.

read the gene count matrix and identify rows in the gene count matrix matching to BIN1 gene

lib <- lib <- specimendata$specimenID[i] count_file <- paste0(lib,"_cells_x_genes")
rowbin1 <- which(grepl(pattern = "^ENSG00000136717",rownames(res_matx))) #row 7729
res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE) totalgene <- res_mat[rowbin1_tcc,]

read the tcc count matrix, identify rows in the t2g file matching BIN1 and use only those rows for counting BIN1 transcripts

lib <- lib <- specimendata$specimenID[i] count_file <- paste0(lib,"_cells_x_tcc")
rowbin1_tcc <- which(grepl(pattern = "\tBIN1$",t2g$V1)) #rows 28581 to 28594 res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE) res_matx_tcc <- res_mat[rowbin1_tcc,]

find total of 14 BIN1 transcripts for each cell

totalbin1 <- colSums(as.matrix(res_matx_tcc))

compare res_matx and res_matx_tcc

table(totalgene,totalbin1)

      totalbin1 (sum of count of 14 transcripts)                          0    1   2 3 4 5 6

0 6877 933 151 29 15 1 0

1           1164 174 30 6 1 0 0

2           182 32 3 0 0 0 1

3          26 3 0 1 0 0 0

4         3 0 0 0 0 0 0

    Please note above that when total of transcripts is >=1, the gene count is still predominantly zero.

Thanks.

Best regards, Rajesh

Yenaled commented 2 months ago

I'm sorry, but I'm not going to look through many lines of unformatted code to identify why you're observing a discrepancy.

A few things to keep in mind: 1) Line numbers (and mtx row/col numbers) are 1-indexed while EC/transcript identifiers are 0-indexed. 2) Equivalence classes will NOT be the same between two different kallisto runs, leading to differences.

Rajesh-HKU commented 2 months ago

Dear Delaney, Thanks for the prompt feedback. Sorry for the unformatted code. I am new to this or any similar forum. Regarding your suggestions: 1) I have taken into account the 1-indexed vs 0-indexed difference 2) I have checked the equivalence classes manually as well. The problem seems to be that for my dataset the counts generated by kallisto when run with tcc= TRUE vs tcc= FALSE option seem to be different. Is this true in other cases you have encountered as well? How do I achieve better correspondence between these two runs?

Thanks.

Best regards, Rajesh

Yenaled commented 2 months ago

Yes, I've been able to recapitulate my desired counts between TCCs and genes on my end.

If you want, you can run the bustools commands on your BUS file, and manually run bustools count to get TCC matrices and gene count matrices from the same BUS file. (You can even download bustools from source and edit the bustools_count.cpp C++ file to manually output things).

lauraluebbert commented 2 months ago

Just linking the original issue here for reference https://github.com/pachterlab/gget_examples/issues/6