mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
17 stars 9 forks source link

clarification on Isoform quantification parameters and output #44

Open sparthib opened 5 days ago

sparthib commented 5 days ago

Hi there,

I am running the FLAMES single cell pipeline, I was wondering if I could get clarification on how the min_sup_cnt parameter affects samples of different sequencing depths. For example, when I ran it under default setting, i.e. min_sup_cnt = 5, smaller replicates had less unique isoforms in the transcript_counts matrix as opposed to bigger replicates. This makes sense, I changed the setting to 2 to see if the number of isoforms in the final output (transcript_count.csv.gz) increases , and it did, however, the number of isoforms in the transcript_count.bad_coverage.csv.gz also increases, while I had imagined it would be the opposite, since with a more lenient threshold for read count I would want the bad coverage counts matrix to contain lesser number of isoforms. I guess my question is, what exactly does the transcript_count.bad_coverage.csv.gz keep track of? Also, if you have suggestions for a more flexible setting other than min_sup_cnt to account for multiple sequencing depths, that would be great.

Thanks,

Sowmya

ChangqingW commented 3 days ago

It is documented in create_config:

> ?FLAMES::create_config
...
          min_sup_cnt - Minimum number of read support an isoform
              decrease this number will significantly increase the
              number of isoform detected.

          min_cnt_pct - Minimum percentage of count for an isoform
              relative to total count for the same gene.

          min_sup_pct - Minimum percentage of count for an splice chain
              that support a given transcript start/end site
              combination.

I believe transcript_count.bad_coverage.csv.gz keeps track of alignments with coverage less than min_tr_coverage, i.e. your read aligned to transcript A but only covers it less than min_tr_coverage. I am adding oarfish as an optional quantification method, which will hopefully give better counts as it will attempt to allocate the ambiguous alignments . https://github.com/COMBINE-lab/oarfish

I am not sure I understand what you mean by multiple sequencing depths, do you have multiple samples with different sequencing depth?

sparthib commented 3 days ago

Thank you! Yes, I have samples with different sequencing depths. Is it possible to change the cutoff from absolute count to perhaps TPM? Although I am not sure how much difference this would make for isoforms with extremely small number of reads aligned to them.

Thanks! Sowmya

ChangqingW commented 1 day ago

Thank you! Yes, I have samples with different sequencing depths. Is it possible to change the cutoff from absolute count to perhaps TPM? Although I am not sure how much difference this would make for isoforms with extremely small number of reads aligned to them.

Thanks! Sowmya

Not at the moment, it does sound reasonable, maybe we could update it.