Closed bio-info-guy closed 9 months ago
You can't get TCCs that way. You want to get nascent, ambiguous, mature matrices -- but you can't do so with TCCs (how do you assign an equivalence class as being "mature", "ambiguous", or "nascent"?). As for TPMs, what exactly does "effective length" mean when including unspliced transcripts?
This is why it was intentionally left out (although we'll probably insert an informative error message in a later release).
Thank you for the response, I understand the problem with finding an effective length for unspliced transcripts. I am still a bit confused as to the inability of assigning an equivalence class as being "mature", "ambiguous", or "nascent", because it seems the program still produces the relevant nascent, ambiguous, mature cells_x_tcc.mtx matrices. The matrices are all of dimensions: cells x number_of_ec, and the proportions of nascent vs (mature + ambiguous) are similar to what is obtained with the --tcc option off.
I would not use —tcc with that workflow, period. It has not been benchmarked, validated, or used on any analysis. You can always get numbers out but that doesn’t mean those numbers are actually useful. What does an “ambiguous” TCC actually tell you and what can you actually gain from it? You can run kallisto’s EM algorithm on it and you’ll get numbers, but what exactly do those numbers mean? What are the implications of multimapping (between genes, between transcripts, and between splicing statuses)?
Perhaps these are open research questions, but that’s not kb-python’s purpose to address. You can work around kb-python to tackle these open questions, but it’s not something I have the ability to provide user support for.
Thank you for the explanations, I will close this issue now
I have been trying to run the nac workflow with the --tcc option enabled with the hopes of getting transcript level estimates for both nascent and mature transcripts.
The exact command that was run:
Command output (with
--verbose
flag)As a side note, I was wondering if this is a viable way to get transcript level count estimates for unspliced transcripts, since the un-spliced transcript is essentially the entire gene (including introns) now. Also, looking in count.py, I noticed that there doesn't seem to be a TPM level quantification output in nac workflow. I could going about this in the wrong way and perhaps estimating transcript level counts for un-spliced transcripts is not viable, which would render TPM quantification futile anyways.