Closed vineeth-s closed 2 years ago
Thank you for your comment. Could you please craft a real example that we can run so that we can 1) make sure that we understand the problem and 2) start to discuss a solution?
Thanks for the reply, here is a short set of fastq reads
@M01579:268:000000000-K3Y7K:1:1101:9391:1930 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1931 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1932 1:N:0:122
TTACGTTCACCGTTAGAGTCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCGCTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1933 1:N:0:122
CGTTAGAGTTTACGTTCACCACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCACTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
@M01579:268:000000000-K3Y7K:1:1101:9391:1934 1:N:0:122
CGTTAGAGTCTTACGTTCACACTTCGTACGTACGGCGTCTTATACTTTGTTTGTTTAACTTTGTGTCACTACCTCAGTTTGCCCCCATGTCCCTTACACACACGCAAAATACTCCTTCAGCGGAGCGAAGAGGTGGCGGATGACTGGCACGCTCCATGACCGGCCCAGCAGTCTCTGCCTCGCCAAGCGGCTCATGTTGGAGACGTCAGTATAGTGGACTGGGAAACCAAATACCCTGGGGGAGAAAAGGCAGAGAGGGCAGGGTGAGTGCAGGCCAGACCAGGCTGCCCGGAAGCCGTCTAACCACACAGCAGGACCCGGAGGACCAGCAGCCACCCGAAGTGCGGGGACAGGGGCACTCACACCCACCAACTCCTCTGACGCCTGCTTA
+
CCCCCGGGGGGFGGGGCGGGGGGGFEGGFGGGGGGEEGGGGAFAEEFGFGGGCEEF9EEGGGGGGGGGFG7BFF@DCECEAECA<FCFFGGFFGGGGIIIGIGIGIIDI<IEFIIIGB3;?I>==II?IIBIGEII=IIDFI@IIBIEIIIIIIIIIIIIIIIEIIIIII3>I;:HFIIGIIIII>IIIII8D>I=IFIIIF<;IIEIIIIIIIIII>=IGFI@II=I=D5IIDCFIFIIIIIIIIIIIIIIFCIIIIIIIIIIIIIIDIE,AF<AICAIFFIBDFIIIIIIIII+4B,8FB:,,FEA8GGECFCGGGGGGGGGFE,C@@FC++9+@DGGF:@:GGGGGGFCD<<FEC6,7@,@+,8F<CFFEF@C<F7GGCED@C?BA9C
In the first 3 reads, is the stretch
CGCTACCTCAGTT
In the last 2 reads, this stretch is
CACTACCTCAGTT
The UMI is the first 20 bp of each read, starcode-umi correctly pulls them apart as the first 3 reads have the same UMI, and the last 2 the same, but the stretch above is called as
CGCTACCTCAGTT
in both consensus sequences
Thanks! Can you include the Starcode command you use to cluster the reads?
Here we go ...
starcode-umi --umi-len 20 --umi-d 4 --seq-d 8 --umi-cluster s --seq-cluster s --seq-id --umi-threads 20 --seq-threads 20 --starcode-path starcode --seq-trim 0 test.fastq > clones.txt
Thanks for providing the details. It seems that you have uncovered a deep bug in the spheres clustering algorithm. I should be able to fix it very soon.
Actually, this is not a bug; this is the expected behavior of the code. starcode-umi
clusters the UMIs and the sequences independently, and then matches them. In your example, the sequences after the UMIs are within Levenshtein distance 1 so they are collapsed into a single canonical sequence in the first step of the algorithm.
Most of the time, this is what the user wants. If the sequences are very close (especially considering that they are 371 nucleotides long) it is likely that they actually correspond to the same biological sequence. However, this is not always the case and that depends on the problem and on the experiment.
Why do you think that in your case the output is invalid?
Hi Guillame, we are interested in SNPs — and this information gets lost
I don't think that Starcode is the right tool to study SNPs. However you put it, the minor allele will be clustered with the major one... that's what the algorithm is for.
If you want to cluster the UMIs but not the reads, then the best is to run Starcode on the UMIs with the option --tidy
. This will give you a "new" UMI for every read that you can use for grouping. Maybe it would work even better with --seq-id
that would give you the index of all the reads that you should group together.
Yes, that is true; we use Starcode for preprocessing the reads, not for variant calling itself
The way you suggest is what we have ended up implementing in the workflow now
It would be cool though to have a sequence grouping mechanism that would be cognizant of different parts of sequences simultaneously
Let us consider the following situation, where
u1,u2,u3,u4,u5 are UMI sequences, u1,u2,u3 group together as the same UMI, and u4,u5 form the other UMI
the sequence following the : is the "amplicon"
Now starcode groups the UMIs, and then clusters the amplicons to get the canonical
AAAGTCAA
The way starcode-umi operates, u1,u2,u3 are grouped together and gets the right canonical sequence, while u4,u5 are grouped together correctly but now gets assigned the wrong canonical sequenceAAAGTCAA
when it should beAAATTCAA