ksahlin / NGSpeciesID

Reference-free clustering and consensus forming of long-read amplicon sequencing
GNU General Public License v3.0
49 stars 14 forks source link

flexible clustering ? #23

Open omarkr8 opened 1 year ago

omarkr8 commented 1 year ago

Is there a way to adjust clustering parameters?

for example, some OTU pipelines will generate different number of bins depending on whether you want 98% or 95% similarity clusters. I do not see options for this for NGSpID. on that note, what IS the perc. threshold used here?

ksahlin commented 1 year ago

Yes, there are several parameters to adjust (NGSpeciesID uses isONclust for the clustering step). However, NGSpeciesID is not built for exact separation/clustering of sequences at a pre-determined exact similarity rate.

You can mimic very stringent clustering by setting large --k, lower --w and high --mapped_threshold and --aligned_threshold. --mapped_threshold and --aligned_threshold could be set to 0.9 and --k between 20 and 30 and --w between k+10 and k+30 if you are working with sequences without much errors.

The above suggestion will not work when your sequences have higher error rates than around 2%, such as in ONT long reads.

FYI, all of these parameters relates to the clustering:

  --k K                 Kmer size (default: 15)
  --w W                 Window size (default: 50)
  --min_shared MIN_SHARED
                        Minmum number of minimizers shared between read and cluster (default: 5)
  --mapped_threshold MAPPED_THRESHOLD
                        Minmum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read. (default: 0.7)
  --aligned_threshold ALIGNED_THRESHOLD
                        Minmum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read. (default: 0.4)
  --min_fraction MIN_FRACTION
                        Minmum fraction of minimizers shared compared to best hit, in order to continue mapping. (default: 0.8)
  --min_prob_no_hits MIN_PROB_NO_HITS
                        Minimum probability for i consecutive minimizers to be different between read and representative and still considered as mapped region, under assumption that they come from the same transcript (depends on read quality).
                        (default: 0.1)