torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
674 stars 125 forks source link

interpretation of --maxaccepts vsearch #569

Open Robvh-git opened 3 months ago

Robvh-git commented 3 months ago

Hello,

I've got a question regarding the argument --maxaccepts of the vsearch command --cluster_fast:

The manpage states the following about maxaccepts:

"The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If --maxaccepts is set to a higher value, more hits are accepted"

What is exactly meant with "If --maxaccepts is set to a higher value, more hits are accepted" ?

What will happen when another hit is accepted?

I guess the target sequences are the centroids or seed sequences of the clusters in this case?

So these are clusters (i.e. target sequences) are sorted based on number of k-mers in common, which will likely resemble pairwise sequence similarity.

I can understand that if --maxaccepts 1(default) is specified, vsearch then starts to go through these pairwise alignment and selects the first one that matches the criteria (e.g. 97% similarity). Then the query sequence is placed in that cluster(?)

But if e.g. --maxaccepts 2 is specified, the query sequence can be accepted in two clusters? Or how does this work?

I can imagine that the first alignment that matches the criterion is not the best one and so that you preferably check multiple accepted target sequences and select the best one from that (i.e. place your query sequence in the cluster that matches best). Is that what --maxaccepts is about? In that case, I would except a description like: " If --maxaccepts is set to a higher value, more hits are accepted and the best matching target sequence is finally selected as hit" or something like that.

torognes commented 3 months ago

Hi

Thank you for your questions. I'll try to clarify.

During clustering and other many other tasks, vsearch will perform heuristic searches to find similar sequences. This is done, as you describe, by first considering the number of shared k-mers (8-mers by default) between the query and each target sequence. The target sequences are then sorted by decreasing number of shared k-mers. The sequence with the highest number of shared k-mers is considered first. If this sequence has the required amount of similarity with the query sequence in terms of percentage identity (e.g. 97%) or other requirements (depending on options used), it is "accepted". If it does not satisfy the requirements, it is "rejected". If the --maxaccepts option is used and set to higher than 1 (default), the next target sequence, with the next highest number of shared k-mers, will also be considered. If this sequence also meets the requirements (e.g. 97% identity), it will also be accepted. In this way more than one sequence may be accepted. When the maximum number of accepted sequences (option --maxaccepts, default 1) or rejected sequences (option --maxrejects, default 32) is reached, vsearch will stop considering more target sequences for this query.

What happens if more than one target sequence is accepted? When clustering, the default is to sort the accepted sequences by sequence similarity and choose the target sequence, i.e. centroid, that has the highest similarity. The query sequence is then placed in that cluster. Alternatively, if the --sizeorder option is specified, the accepted centroids will be sorted by abundance, and the centroid with the highest abundance will be chosen.

When searching, not clustering, one or more of the target sequences may be reported as hits for the query, depending on the --maxhits and --top_hits_only options.

I agree that the documentation could be clearer regarding this issue. We will try to improve it for the next release.

Robvh-git commented 3 months ago

Hi @torognes , thank you for the elaborate answer and it is completely clear now. I think it indeed could be helpful to add this info to the docmentation.

torognes commented 3 months ago

Reopening the issue to remember to update the documentation.