Error when Running nac with tcc option

bio-info-guy commented 9 months ago

I have been trying to run the nac workflow with the --tcc option enabled with the hopes of getting transcript level estimates for both nascent and mature transcripts.

The exact command that was run:

kb count --overwrite --tcc  --workflow nac -x SMARTSEQ2  -t 10 -m 64G -i ./mouse_spike.idx -g ./mouse_spike.t2g.txt  -c2 ./mouse_spike.ic.txt -c1 ./mouse_spike.tc.txt --parity paired --loom samples.txt

Command output (with --verbose flag)

[2023-12-09 16:58:56,144]    INFO [count_nac] Using index ./mouse_spike.idx to generate BUS file to . from
[2023-12-09 16:58:56,145]    INFO [count_nac]         /projects/zcsu_research1/ysu/mouse/analysis/velocity/kb/tmp/tmpq63z_0e8
[2023-12-09 17:27:17,759]    INFO [count_nac] Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2023-12-09 17:27:43,493]    INFO [count_nac] On-list not provided
[2023-12-09 17:27:43,494]    INFO [count_nac] Generating on-list ./whitelist.txt from BUS file ./tmp/output.s.bus
[2023-12-09 17:27:44,697]    INFO [count_nac] Inspecting BUS file ./tmp/output.s.bus
[2023-12-09 17:27:45,799]    INFO [count_nac] Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with on-list ./whitelist.txt
[2023-12-09 17:27:47,302]    INFO [count_nac] Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2023-12-09 17:27:58,001]    INFO [count_nac] Generating count matrix ./counts_unfiltered/cells_x_tcc from BUS file ./output.unfiltered.bus
[2023-12-09 17:36:21,761]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/users/ysu13/.local/anaconda3/envs/kb/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
  File "/users/ysu13/.local/anaconda3/envs/kb/lib/python3.10/site-packages/kb_python/main.py", line 592, in parse_count
    count_nac(
  File "/users/ysu13/.local/anaconda3/envs/kb/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/users/ysu13/.local/anaconda3/envs/kb/lib/python3.10/site-packages/kb_python/count.py", line 2078, in count_nac
    genes_paths=[
  File "/users/ysu13/.local/anaconda3/envs/kb/lib/python3.10/site-packages/kb_python/count.py", line 2079, in <listcomp>
    unfiltered_results[prefix][f'txnames{suffix}'] if tcc
KeyError: 'txnames'

As a side note, I was wondering if this is a viable way to get transcript level count estimates for unspliced transcripts, since the un-spliced transcript is essentially the entire gene (including introns) now. Also, looking in count.py, I noticed that there doesn't seem to be a TPM level quantification output in nac workflow. I could going about this in the wrong way and perhaps estimating transcript level counts for un-spliced transcripts is not viable, which would render TPM quantification futile anyways.

Yenaled commented 9 months ago

You can't get TCCs that way. You want to get nascent, ambiguous, mature matrices -- but you can't do so with TCCs (how do you assign an equivalence class as being "mature", "ambiguous", or "nascent"?). As for TPMs, what exactly does "effective length" mean when including unspliced transcripts?

This is why it was intentionally left out (although we'll probably insert an informative error message in a later release).

bio-info-guy commented 9 months ago

Thank you for the response, I understand the problem with finding an effective length for unspliced transcripts. I am still a bit confused as to the inability of assigning an equivalence class as being "mature", "ambiguous", or "nascent", because it seems the program still produces the relevant nascent, ambiguous, mature cells_x_tcc.mtx matrices. The matrices are all of dimensions: cells x number_of_ec, and the proportions of nascent vs (mature + ambiguous) are similar to what is obtained with the --tcc option off.

Yenaled commented 9 months ago

I would not use —tcc with that workflow, period. It has not been benchmarked, validated, or used on any analysis. You can always get numbers out but that doesn’t mean those numbers are actually useful. What does an “ambiguous” TCC actually tell you and what can you actually gain from it? You can run kallisto’s EM algorithm on it and you’ll get numbers, but what exactly do those numbers mean? What are the implications of multimapping (between genes, between transcripts, and between splicing statuses)?

Perhaps these are open research questions, but that’s not kb-python’s purpose to address. You can work around kb-python to tackle these open questions, but it’s not something I have the ability to provide user support for.

bio-info-guy commented 9 months ago

Thank you for the explanations, I will close this issue now

pachterlab / kb_python

Error when Running nac with tcc option #227