centroid sequence length after clustering is different from input sequences' length which are all equal to 200n

Hey all, hope you're doing fine...

I have a problem, I will provide the input/command/output first then ask the question. Thanks a lot in advance for your help!

Input/command/output description

I have the following fasta file called test_input.fasta, which contains three sequences of same length (200):

>A
CGGGAAGCCCAAGGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACC
ACCGGCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCG
CCCACACCGTGTCGTTCGCGTGCGACATGTCGTGGGAGGAACTGCTGTGGCTGGCCGACG
GCCACGAGGTTCACATCTGC
>B
CGGGAAGCCCAAGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCCACCA
CCGGCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGC
CCACACCGTGTCGTTCGCGTTCGACATGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGG
CCACGAGGTGCACATCTGCG
>C
CGGGAAGCCCAAGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCA
CCGGCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGC
CCACACCGTGTCGTTCGCGTTCGACTTGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGG
CCACGAGGTGCACATCTGCG

I run the following command:

 vsearch --cluster_size test_input.fasta --consout test_output.fasta --clusterout_id --id 0.97 --iddef 4 --sizein --sizeout --uc test_uc.uc &> test_log.log

and got the following results:

test_log.log:


vsearch v2.27.0_linux_x86_64, 125.7GB RAM, 32 cores
https://github.com/torognes/vsearch

Reading file test_input.fasta 100% 600 nt in 3 seqs, min 200, max 200, avg 200 Masking 100% Sorting by abundance 100% Counting k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 1 Size min 3, max 3, avg 3.0 Singletons: 0, 0.0% of seqs, 0.0% of clusters Multiple alignments 100%

2. `test_output.fasta`:

centroid=A;seqs=3;clusterid=0;size=3 CGGGAAGCCCAAGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCGGCGCCGGATCTTCGAGC CGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACCGTGTCGTTCGCGTTCGACATGTCGTGGGAGGAA CTGCTGTGGCTGGCCGACGGCCACGAGGTGCACATCTGC
3. `test_uc.uc`:
S 0 200 A H 0 200 97.5 + 0 0 12MI187MD B A H 0 200 97.5 + 0 0 12MI187MD C A C 0 3 A

Problem description

The output sequence is of length 199 instead of 200, even though all the sequences are of the exact same length!

This is one example of the problem, I'm running a pipeline with 1,058 fasta files(containing 5,235,523 sequences), all sequences of length 200. I dereplicate using vsearch --fastx_uniques and poping out the sizes, then I cluster these files using the same command (with the same parameters) as the example. I got around 40 cluster consensus sequences with lengths ranging between 199 and 203. Sorry I don't have the exact number right now but I can provide all the clusters that have problems in them if needed.

Question

Why I'm having this behavior? and more importantly, can I still use these sequences for further analysis?

Notes

I can't provide the full sequences nor the full code just yet, since we haven't published our results yet. But I'm more than happy to share the ones that have problems for now and everything in the future!

Hi, thanks for the very well described question!

I do not think it is wrong to end up with consensus sequences of different length even if all the input sequences are the same length. The result will depend on how they align.

It may be easier to understand the behavior of vsearch in this case if we look at the pairwise alignments and the multiple alignment of the sequences by including --msaout msa.fasta --alnout alnout.txt on the command line, like this:

vsearch --cluster_size test_input.fasta --consout test_output.fasta --clusterout_id --id 0.97 --iddef 4 --sizein --sizeout --uc test_uc.uc --msaout msa.fasta --alnout alnout.txt &> test_log.log

In this case vsearch selected sequence A as the centroid, since it is the first in the file and they all have the same abundance. It will then align the other sequences B and C to the centroid sequence (A) as shown in the pairwise alignment output below (alnout.txt):

vsearch --cluster_size test_input.fasta --consout test_output.fasta --clusterout_id --id 0.97 --iddef 4 --sizein --sizeout --uc test_uc.uc --msaout msa.fasta --alnout alnout.txt
vsearch v2.28.1_macos_aarch64, 32.0GB RAM, 12 cores

Query >B
 %Id   TLen  Target
 98%    200  A

 Query 200nt >B
Target 200nt >A

Qry   1 + CGGGAAGCCCAA-GGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCCACCACCG 63
          |||||||||||| ||||||||||||||||||||||||||||||||||||||||||| |||||||
Tgt   1 + CGGGAAGCCCAAGGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCG 64

Qry  64 + GCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACC 127
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt  65 + GCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACC 128

Qry 128 + GTGTCGTTCGCGTTCGACATGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGGCCACGAGGTGC 191
          ||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||| |
Tgt 129 + GTGTCGTTCGCGTGCGACATGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGGCCACGAGGTTC 192

Qry 192 + ACATCTGC 199
          ||||||||
Tgt 193 + ACATCTGC 200

200 cols, 196 ids (97.5%), 1 gaps (0.5%)

Query >C
 %Id   TLen  Target
 98%    200  A

 Query 200nt >C
Target 200nt >A

Qry   1 + CGGGAAGCCCAA-GGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCG 63
          |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt   1 + CGGGAAGCCCAAGGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCG 64

Qry  64 + GCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACC 127
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt  65 + GCGCCGGATCTTCGAGCCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACC 128

Qry 128 + GTGTCGTTCGCGTTCGACTTGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGGCCACGAGGTGC 191
          ||||||||||||| |||| ||||||||||||||||||||||||||||||||||||||||||| |
Tgt 129 + GTGTCGTTCGCGTGCGACATGTCGTGGGAGGAACTGCTGTGGCTGGCCGACGGCCACGAGGTTC 192

Qry 192 + ACATCTGC 199
          ||||||||
Tgt 193 + ACATCTGC 200

200 cols, 196 ids (97.5%), 1 gaps (0.5%)

As you see, there is a gap after position 12 in the alignment because sequence A has 6 G's while sequences B and C have only 5 G's in this region. The final C in sequence A is also not aligned with the other sequences. In addition there are a couple of mismatches, but they do not matter here.

The multiple sequence alignment (MSA) looks like this (msa.fasta):

>*A;size=1
CGGGAAGCCCAAGGGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCGGCGCCGGATCTTCGAG
CCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACCGTGTCGTTCGCGTGCGACATGTCGTGGGAGGA
ACTGCTGTGGCTGGCCGACGGCCACGAGGTTCACATCTGC-
>B;size=1
CGGGAAGCCCAA-GGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCCACCACCGGCGCCGGATCTTCGAG
CCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACCGTGTCGTTCGCGTTCGACATGTCGTGGGAGGA
ACTGCTGTGGCTGGCCGACGGCCACGAGGTGCACATCTGCG
>C;size=1
CGGGAAGCCCAA-GGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCGGCGCCGGATCTTCGAG
CCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACCGTGTCGTTCGCGTTCGACTTGTCGTGGGAGGA
ACTGCTGTGGCTGGCCGACGGCCACGAGGTGCACATCTGCG
>consensus
CGGGAAGCCCAA-GGGGGTGGTGACCGAGTACGCGGGCCTCACCAACATGCTGGTCAACCACCGGCGCCGGATCTTCGAG
CCGGTCCTCGCGCGCCACGGCCACCGGGTCTTCCGGGTCGCCCACACCGTGTCGTTCGCGTTCGACATGTCGTGGGAGGA
ACTGCTGTGGCTGGCCGACGGCCACGAGGTGCACATCTGC+

The star indicates that sequence A is the centroid. The fourth sequence is the consensus sequence, with gaps retained. The sequence that is written to the test_output.fasta file (using option --consout) is identical but lacks the gaps.

The final gap in the consensus is indicated with a plus sign ('+'). This is because this position is not really aligned.

The MSA is created with a very simple algorithm (the center star method) that just bases the MSA on the pairwise alignments between the centroid sequence (A) and each of the other sequences (B and C), i.e. the A-B and the A-C alignments. Gaps at the ends were not included in those pairwise alignments.

SInce the extra G in the beginning is found in a minority of the sequences (just A) it is represented by a gap in the consensus sequence and not included in the test_output.fasta file.

The final G is not included in the consensus because this nucleotide was not aligned since it was not found in the centroid sequence.

Altogether this makes the consensus sequence 1 nucleotide shorter than the other sequences.

In general, I think you could still use the sequences for further analysis, but it depends of course on what the next steps are. Maybe it would be better to just use the centroid sequence instead of the consensus sequence?

You could alternatively extract the sequences in each cluster, perform a more advanced multiple alignment on them (with a different tool than vsearch) and obtain the consensus after that. That might give you a better consensus sequence.

I think the major problem with the current approach is the alignment of the ends of the sequences. If they all started and ended with a fixed sequence, it would be less of a problem.

You could end up with a consensus that is longer or shorter than the other sequences also if the ends were fixed, like this unrealistic example:

Seq A: CGGGAA-GC-CAA
Seq B: CG-GAATGC-CAA
Seq C: CG-GAA-GCCCAA
Cons:  CG-GAA-GC-CAA

I hope this clarifies the issue for you.

torognes / vsearch