pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
647 stars 170 forks source link

Getting isoform information from 10x 3' data with the --tcc option #444

Closed BKover99 closed 2 months ago

BKover99 commented 2 months ago

Hi, I’ve got a couple of technical and conceptual questions about using kallisto with the TCC option. For context, I’ve recently read the Commons Cell Atlas paper (https://www.biorxiv.org/content/10.1101/2024.03.23.586412v1.full) and thought I would explore isoforms in my tissue of interest.

    • conceptual From having read multiple of the issues on this Github page, and the above mentioned publication, it remains unclear to me whether it is a good idea to try to get isoform information using 10X 3’ data. From what I understand there is still some information to be extracted here, e.g. as the CCA paper mentions, about a quarter of genes have most their reads in single equivalence classes. However, from multiple issues on this Github page and from the kallisto-bustools paper I get the opposite feeling, that it is generally discouraged to try to obtain TCC information from 10X data. Am I missing something here?
    • conceptual Would the TCC matrix be sufficient to explore isoform differences between cell types, or is it generally better to go ahead with the EM step and try to obtain abundance estimates? (Or am I right in that there is really no such thing as “best practices” with this?)

kb ref -d mouse -i index_std.idx -g t2g_std.txt kb count -i index_std.idx -g t2g_std.txt -x {tech} --tcc --h5ad -o {output_dir} {' '.join(all_fastqs)}

This gave me some of the anticipated output.

Screenshot 2024-06-25 at 10 02 50

I thought that beyond the TCC matrix, this would give me the abundance matrix (after having performed EM). From this issue (https://github.com/pachterlab/kallisto/issues/423) I thought that the —tcc option should automatically also invoke the EM algorithm as the last step. However I did not find the anticipated "/quant_unfiltered/" folder with the output abundance matrix. Is there something wrong with my command?

    • practical Either way, I then decided to run kallisto quant-tcc to get the abundance matrix that way.

!kallisto quant-tcc -i index_std.idx -g t2g_std.txt -b 0 -e /content/drive/MyDrive/atlas/processed/tcc/SRX8489818/counts_unfiltered/cells_x_tcc.ec.txt -o {output_dir} /content/drive/MyDrive/atlas/processed/tcc/SRX8489818/counts_unfiltered/cells_x_tcc.mtx -t 8

This then appeared extremely slow, also continously printing the following message:

“[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts.[ em] number of priors does not match number of transcripts. defaulting to uniform priors.

    defaulting to uniform priors.

[ em] number of priors does not match number of transcripts. defaulting to uniform priors. [ em] number of priors does not match number of transcripts. defaulting to uniform priors. [quant] Processing sample/cell [quant] Processing sample/cell [quant] Processing sample/cell 71640

71641 [quant] Processing sample/cell 71642t] Processing sample/cell [quant]”

Even after a couple hours it was at about 70000 of the 200k barcodes/cells/samples. I wonder whether my command was wrong/incorrect (especially because of the warning message) or if it really takes this long to perform the EM step. In the latter case, is there anything to speed it up even a little bit?

Thanks in advance for the response! Found the pre-print and the previous discussions here very useful.

Yenaled commented 2 months ago
  1. TCC information is distinct from isoform information. Getting TCC information is valuable as many previous papers from the Pachter lab have shown (Ntranos et al., 2016, Genome Biology; Ntranos et al., 2019, Nat Methods).
  2. You cannot perform EM without TCC matrix (the EM uses TCCs).
  3. The quant folder is only produced for bulk RNAseq
  4. This is why the quant folder is only for bulk RNAseq. The quant-tcc step takes a very long time (and the identifiability problem w.r.t. resolving isoform information from 3’ end data makes me believe that the isoform-level estimates may not be that reliable, hence is disabled). No, the EM cannot be sped up aside from trying more threads (EM only takes a few seconds; but when you multiply that by a couple of thousand, yeah, the time adds up). You can try filtering only for cells that have a sufficient number of UMIs, and write some custom scripts to subset your TCC matrix to only contain those cells; then you might have only a couple thousand of cells rather than hundreds of thousands of cells to run EM on. (P.S. the warning message you see is disabled in the forthcoming release of the new version of kallisto; it is a meaningless message).
BKover99 commented 2 months ago

Hi @Yenaled really appreciate the quick response. Just a couple follow-up questions to make sure I understand the individual points.

  1. is TCC information completely independent from isoform information? My understanding was that if there is a difference in TCCs between two samples (e.g. through logistic regression like in the Ntranos paper), then it would suggest that it is due to differential contributions of isoforms.

  2. Yes of course that’s clear. I think my question here just connects to 1., namely that I am unsure whether TCCs are sufficient to understand differential isoform contribution or whether abundances might be more informative.

  3. Okay, that makes sense.

  4. Great, thanks for the suggestions. Initially, I was thinking of doing something along the lines of what’s been done in the CCA paper:

    “A transcript Compatibility Counts (TCC) matrix for each sample was obtained by running ‘bustools count’ without the ‘--genecount’ option on the bus files generated after pseudoaligning the raw reads. Transcript abundances were quantified using the EM algorithm by running ‘kallisto quant-tcc’ on the TCC matrices. The transcript abundance matrix of each sample was normalized within each cell-type using log1pPF. The normalized matrix was then subsetted to isoforms that i) derived from genes with more than one isoform, ii) had reads in the samples that mapped uniquely to it and iii) had a minimum average normalized expression of 0.002 per cell.”

Do you happen to know if they also just subsetted cells prior to the EM step, or just waited out processing all barcodes?

If you could just briefly address these. Otherwise, happy to close the issue. Thanks for the help!

Yenaled commented 2 months ago
  1. Isoform information is derived from EM. EM is probabilistic assignment going from TCCs to isoforms. Thus TCCs != isoforms. Isoform uncertainty exists; TCC uncertainty does not.

I was not involved at all in that CCA paper so I can’t tell you what was done. But, regardless, if you want speed, do some barcode filtering.