gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

Canonical (consensus) error #42

Closed vineeth-s closed 2 years ago

vineeth-s commented 2 years ago

Let us consider the following situation, where

u1,u2,u3,u4,u5 are UMI sequences, u1,u2,u3 group together as the same UMI, and u4,u5 form the other UMI

the sequence following the : is the "amplicon"

u1 : AAAGTCAA
u2 : AAAGTCAA
u3 : AAAGTCAA
u4 : AAATTCAA
u5 : AAATTCAA

Now starcode groups the UMIs, and then clusters the amplicons to get the canonical AAAGTCAA The way starcode-umi operates, u1,u2,u3 are grouped together and gets the right canonical sequence, while u4,u5 are grouped together correctly but now gets assigned the wrong canonical sequence AAAGTCAA when it should be AAATTCAA

gui11aume commented 2 years ago

Thank you for your comment. Could you please craft a real example that we can run so that we can 1) make sure that we understand the problem and 2) start to discuss a solution?

vineeth-s commented 2 years ago

Thanks for the reply, here is a short set of fastq reads

@M01579:268:000000000-K3Y7K:1:1101:9391:1930 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1931 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1932 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1933 1:N:0:122
CGTTAGAGTTTACGTTCACCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCACTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1934 1:N:0:122
CGTTAGAGTCTTACGTTCACACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCACTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C

In the first 3 reads, is the stretch

CGCTACCTCAGTT

In the last 2 reads, this stretch is

CACTACCTCAGTT

The UMI is the first 20 bp of each read, starcode-umi correctly pulls them apart as the first 3 reads have the same UMI, and the last 2 the same, but the stretch above is called as

CGCTACCTCAGTT

in both consensus sequences

gui11aume commented 2 years ago

Thanks! Can you include the Starcode command you use to cluster the reads?

vineeth-s commented 2 years ago

Here we go ...

starcode-umi --umi-len 20 --umi-d 4 --seq-d 8 --umi-cluster s --seq-cluster s --seq-id --umi-threads 20 --seq-threads 20 --starcode-path starcode --seq-trim 0 test.fastq > clones.txt
gui11aume commented 2 years ago

Thanks for providing the details. It seems that you have uncovered a deep bug in the spheres clustering algorithm. I should be able to fix it very soon.

gui11aume commented 2 years ago

Actually, this is not a bug; this is the expected behavior of the code. starcode-umi clusters the UMIs and the sequences independently, and then matches them. In your example, the sequences after the UMIs are within Levenshtein distance 1 so they are collapsed into a single canonical sequence in the first step of the algorithm.

Most of the time, this is what the user wants. If the sequences are very close (especially considering that they are 371 nucleotides long) it is likely that they actually correspond to the same biological sequence. However, this is not always the case and that depends on the problem and on the experiment.

Why do you think that in your case the output is invalid?

vineeth-s commented 2 years ago

Hi Guillame, we are interested in SNPs — and this information gets lost

gui11aume commented 2 years ago

I don't think that Starcode is the right tool to study SNPs. However you put it, the minor allele will be clustered with the major one... that's what the algorithm is for.

If you want to cluster the UMIs but not the reads, then the best is to run Starcode on the UMIs with the option --tidy. This will give you a "new" UMI for every read that you can use for grouping. Maybe it would work even better with --seq-id that would give you the index of all the reads that you should group together.

vineeth-s commented 2 years ago

Yes, that is true; we use Starcode for preprocessing the reads, not for variant calling itself

The way you suggest is what we have ended up implementing in the workflow now

It would be cool though to have a sequence grouping mechanism that would be cognizant of different parts of sequences simultaneously