Closed BKover99 closed 2 months ago
Hi @Yenaled really appreciate the quick response. Just a couple follow-up questions to make sure I understand the individual points.
is TCC information completely independent from isoform information? My understanding was that if there is a difference in TCCs between two samples (e.g. through logistic regression like in the Ntranos paper), then it would suggest that it is due to differential contributions of isoforms.
Yes of course that’s clear. I think my question here just connects to 1., namely that I am unsure whether TCCs are sufficient to understand differential isoform contribution or whether abundances might be more informative.
Okay, that makes sense.
Great, thanks for the suggestions. Initially, I was thinking of doing something along the lines of what’s been done in the CCA paper:
“A transcript Compatibility Counts (TCC) matrix for each sample was obtained by running ‘bustools count’ without the ‘--genecount’ option on the bus files generated after pseudoaligning the raw reads. Transcript abundances were quantified using the EM algorithm by running ‘kallisto quant-tcc’ on the TCC matrices. The transcript abundance matrix of each sample was normalized within each cell-type using log1pPF. The normalized matrix was then subsetted to isoforms that i) derived from genes with more than one isoform, ii) had reads in the samples that mapped uniquely to it and iii) had a minimum average normalized expression of 0.002 per cell.”
Do you happen to know if they also just subsetted cells prior to the EM step, or just waited out processing all barcodes?
If you could just briefly address these. Otherwise, happy to close the issue. Thanks for the help!
I was not involved at all in that CCA paper so I can’t tell you what was done. But, regardless, if you want speed, do some barcode filtering.
Hi, I’ve got a couple of technical and conceptual questions about using kallisto with the TCC option. For context, I’ve recently read the Commons Cell Atlas paper (https://www.biorxiv.org/content/10.1101/2024.03.23.586412v1.full) and thought I would explore isoforms in my tissue of interest.
kb ref -d mouse -i index_std.idx -g t2g_std.txt kb count -i index_std.idx -g t2g_std.txt -x {tech} --tcc --h5ad -o {output_dir} {' '.join(all_fastqs)}
This gave me some of the anticipated output.
I thought that beyond the TCC matrix, this would give me the abundance matrix (after having performed EM). From this issue (https://github.com/pachterlab/kallisto/issues/423) I thought that the —tcc option should automatically also invoke the EM algorithm as the last step. However I did not find the anticipated "/quant_unfiltered/" folder with the output abundance matrix. Is there something wrong with my command?
!kallisto quant-tcc -i index_std.idx -g t2g_std.txt -b 0 -e /content/drive/MyDrive/atlas/processed/tcc/SRX8489818/counts_unfiltered/cells_x_tcc.ec.txt -o {output_dir} /content/drive/MyDrive/atlas/processed/tcc/SRX8489818/counts_unfiltered/cells_x_tcc.mtx -t 8
This then appeared extremely slow, also continously printing the following message:
“[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts. defaulting to uniform priors.
[ em] number of priors does not match number of transcripts. defaulting to uniform priors. [ em] number of priors does not match number of transcripts. defaulting to uniform priors. [quant] Processing sample/cell [quant] Processing sample/cell [quant] Processing sample/cell 71640
71641 [quant] Processing sample/cell 71642t] Processing sample/cell [quant]”
Even after a couple hours it was at about 70000 of the 200k barcodes/cells/samples. I wonder whether my command was wrong/incorrect (especially because of the warning message) or if it really takes this long to perform the EM step. In the latter case, is there anything to speed it up even a little bit?
Thanks in advance for the response! Found the pre-print and the previous discussions here very useful.