torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
655 stars 122 forks source link

Found 2 identical consensus sequences with --id 0.9 #307

Closed MaestSi closed 6 years ago

MaestSi commented 6 years ago

Dear Vsearch developer, I am trying to see if my sample sequenced with Illumina includes one or more organisms, therefore I tried clustering with --id 0.9 and then BLASTed only consensus sequences for each cluster. I later noticed that I have at least 2 clusters with exactly identical consensus sequence. Why haven't they been merged? Should I run manually a second Vsearch run, passing as input the fasta file containing clusters consensus sequences? I tried both --cluster_fast and --cluster_size options, but this didn't change. The only explanation I have is that at the beginning of the clustering process I had 2 reads differing for more than 10%, which resulted in having 2 different clusters. What do you think about this? Thank you in advance.

colinbrislawn commented 6 years ago

Hello @MaestSi

Strange! Can you post the full command(s) you ran that produced the duplicate consensus sequence? Can you also post the consensus sequence from the vsearch output (not the blast hit)?

The only explanation I have is that at the beginning of the clustering process I had 2 reads differing for more than 10%, which resulted in having 2 different clusters.

This could be the case. Looking at the vsearch output (not the blast output!) will help us solve this issue.

Thanks!

Colin

MaestSi commented 6 years ago

Sure, the command I ran is:

vsearch --cluster_size sample_name.fasta --clusterout_sort --strand both --id 0.9 --fasta_width 0 --sizeout --consout cons.fasta

The identical consensus sequences are:

centroid=NB500897:212:HYJNYAFXX:1:11201:1452:5991_2:N:0:TAGGCATG;seqs=53;size=53; TAAACTTCAGGGTGACCAAAAAATCAAAATAAGTGTTGGTATAAAATGGGGTCTCCTCCTCCTGTAGGGTCAAAGAAGCTAGTATTTAAATTTCGATCGGTTAATAGTATAGTAATTGCCCCTGCTAGAACAGGTAATGAAAGTAAAAGT

centroid=NB500897:212:HYJNYAFXX:1:21101:3437:16599_2:N:0:TAGGCATG;seqs=44;size=44; TAAACTTCAGGGTGACCAAAAAATCAAAATAAGTGTTGGTATAAAATGGGGTCTCCTCCTCCTGTAGGGTCAAAGAAGCTAGTATTTAAATTTCGATCGGTTAATAGTATAGTAATTGCCCCTGCTAGAACAGGTAATGAAAGTAAAAGT

Thanks.

colinbrislawn commented 6 years ago

What happens when you run this using --centroids instead of --consout? The consout option changes the reads, but centroids just keeps them.

And you can sort them after like I showed in the other post!

MaestSi commented 6 years ago

If I use --centroids, the sequences I obtain are 87% similar, so they have correctly been put into 2 different clusters. Here are the 2 centroids:

NB500897:212:HYJNYAFXX:1:11201:1452:5991_2:N:0:TAGGCATG;size=53; TAAACTTCAGGGTGACCAAAAAATCAAAATAAGTGTTGGTATATGATGGGGTCTCCTCCTCCTGTAGGGTCAAAGAAGCTAGTATTTTAATTTCGATCGGTTATTTGTATAGTAATTGCCCCTGCGAGATCAGGTAATGGAAGTAAAAGT

NB500897:212:HYJNYAFXX:1:21101:3437:16599_2:N:0:TAGGCATG;size=44; TAAACTTCAGGGTGACAAAAAAATCAAAATAAGTGTTGGTATAAAATGGGGTCTCATCATCCTGTATTTTAAAATATTCTAGTATTTAAATTTCGATCGGTTAATAGTATAGTAATTGCACCTGCTAGAACAGGTAATGAAAGTAAAAGT

Thanks.

colinbrislawn commented 6 years ago

Ok great! I think you can safely close this issue!