ksahlin / NGSpeciesID

Reference-free clustering and consensus forming of long-read amplicon sequencing
GNU General Public License v3.0
50 stars 16 forks source link

How to find allelels #19

Closed Thomieh73 closed 2 years ago

Thomieh73 commented 2 years ago

Hi, I am trying to process amplicon sequences of a gene (daa) that in some animals has multiple allelels. Now I am wondering how to adjust the clustering parameters so that I can split the main cluster into two clusters.

This is my command so far:

NGSpeciesID --ont --consensus\
 --sample_size 500 \
  --m 1000 --s 50
 --medaka \
--aligned_threshold 0.2\
 --fastq ${MYSAMPLE}.fastq\
  --outfolder ${MYSAMPLE}.${NUMBER}r.${SIZE}.out \
--primer_file ${PRIMERS}

The first thing that crossed my mind was to make the aligned_threshold stricter? But do you have experience with the other settings? Which one would be good to use as well?

ksahlin commented 2 years ago

Hi @Thomieh73,

Multiple-allele separation is not yet implemented in NGSpeciesID. However, this is on the radar as soon as we have the right person to implement this.

It depends a bit on how similar the alleles are. If we are talking 99% identical, it may be difficult/impossible with NGSpeciesID.

In case ~97% or lower identity, you could try increasing the values of --aligned_threshold, --mapped_threshold and --rc_identity_threshold. A guesstimate is to set them --aligned_threshold 0.9 --mapped_threshold 0.9 --rc_identity_threshold X, where X is a fraction a bit above the expected allele difference (so that in the case of multiple clusters, NGSpeciesID doesn't merge them thinking they are reverse complements).

ksahlin commented 2 years ago

Let me know how it goes.

Thomieh73 commented 2 years ago

Thanks for the feedback @ksahlin, I will try that. I had already figured out that the --rc_identity_threshold was key here, since I noticed that my alleles could be grouped together when they are too close :-)

Thomieh73 commented 2 years ago

Okay, with this command,

NGSpeciesID --ont --consensus --sample_size 500 --m 1000 --s 50 \
     --medaka \
    --rc_identity_threshold 0.96\
    --aligned_threshold 0.8\
     --mapped_threshold 0.9\
     --fastq ${MYSAMPLE}.fastq  --outfolder ${MYSAMPLE}.${NUMBER}r.${SIZE}.out --primer_file ${PRIMERS}

I get two clusters popping out of my data. Funny things is, in the few replicate runs I tried, is that one of the clusters seems to be in the other orientation. So I get different clusters, and in an alignment they are different, showing SNPs. So it sort of works, but I am getting the feeling that it would create much more manual QC to check if the consensus sequences that I get are actually making sense. In addition, due to the random subsampling to select the reads before generating clusters, I sometime get an extra cluster and sometimes it doesn't. I believe that has to do with the original PCR product composition, which in our case, was not always giving optimal products. It is a possible that when there is two allelels, that one of them amplifies better, so it is then harder to pull out of the sequence data.

ksahlin commented 2 years ago

Yes, it seems a bit shaky. A multiple-allele separation feature is definitely desirable in these cases. Hope to add such a feature soon.

Thomieh73 commented 2 years ago

That would be great. But for now, I will close this issue. Thanks for making a nice tool, I do trust it for the purpose you specified, and I recommended it to my fellow colleagues working with meta-barcoding data.