How to set --min_candidate_support should be appropriate？

ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.

GNU General Public License v3.0

15 stars 1 forks source link

How to set --min_candidate_support should be appropriate？ #5

Open wheatwill opened 5 years ago

wheatwill commented 5 years ago

Hi，when I run IsoCon，I found the results vary greatly with different --min_candidate_support set. So I wonder how to set this parameter is ok？

ksahlin commented 5 years ago

It depends primarily on the characteristics of your data, but also on your goals. In general, the lower the cutoff, the more sensitive the algorithm will be (that is, detect more low expressed sequences but also predict more erroneous sequences).

What type of data is it ? Is the data pacbio CCS reads? Is the data targeted with primers specific to capture a specific 5' and 3' location (meaning that the ends will be well defined)?
How deep is the sequencing? Specifically: How many reads? How many transcripts variants do you expect, ballpark number, 10? 100? 1000? Number of reads divided by expected number of transcripts can give you an estimate of how you should set the cutoff.
Gene family/species can also be useful to know

wheatwill commented 5 years ago

Thank you very much for your quick reply! Actually, I am running a set of nontargeted Iso-Seq data. The gene family I am interested in is expected to contain 10-20 members（tandem repeat genes, but only 3 of them have been assembled successfully at the reference genome. So I try to get other transcripts from a full-length transcriptome generated by Pacbio RSII. I used the blastn method to get 1500 sequences from all the flnc reads. Then I run the isoline pipeline directly:IsoCon pipeline -fl_reads blast.out.flnc.fasta -outfolder test.IsoCon.out --ccs polished.total.flnc.bam --nr_cores 24 --min_candidate_support 10. --min_candidate_support 10 get 4 final candidates --min_candidate_support 5 get 15 final candidates

Should I trim these blast out flnc reads at the same start and end position?

ksahlin commented 5 years ago

Trimming the start and ends at the same locations will greatly help IsoCon at finding the variants and work as it was designed for. This is the very much preferred option! Let's see if you get the same variability after this.

You can do some post analysis of IsoCon's results by looking at the read support of each final candidate (could be done as sanity check for results both with or without trimming ends). The support can be observed by counting the number of reads that were assigned to each consensus in the cluster_info.tsv file. (Alternatively, the accessions of the candidates in the final_candidates.fa contains related information of how many reads that supports them, but counting rows in the tsv is more exact).