clustering unique input sequences produces unexpected output

rpadmanabhan commented 6 years ago

Hi, I ran starcode on a file with 1 unique sequence per line with the following command

/srv/qgen/code/starcode/starcode -t 4 -r 1 -d 8 --print-clusters -i primers.txt -o primers.clustered.txt

the 2nd column of the output (cluster size) has numbers greater than the sequences in the 3rd column (the sequences in the cluster) , this does not seem to make sense as I have only 1 unique sequence in each line. Am I specifying some incorrect command line parameter ?

Attached the input and output files. primers.clustered.txt primers.txt

gui11aume commented 6 years ago

Thank you for submitting this issue. I could reproduce the bug, which apparently is due to the combination -r1 -d8. We will look into this as soon as possible.

gui11aume commented 6 years ago

On closer inspection, this is not a bug, but a poorly known feature of Starcode.

Observe that the sum of the numbers in the output is equal to the number of sequences in the input. Depending on the clustering conditions, it can be that some sequences can be merged to two different clusters. These ambiguities are particularly annoying, but there is no way to get rid of them in the message passing clustering algorithm.

We have opted to not show them in the output, but this in turn has created some inconsistencies with the sequence counts (which are always showed in the output). Please accept my apologies for the time you wasted on this issue.

Sequences that cannot be unambiguously clustered is a recurrent issue, and we haven't yet found an elegant way to deal with them. If this is information is not enough to solve your issue, please let me know what would be needed. Also, if you have some suggestions on how to gracefully deal with this case, I would be happy to hear them.

rpadmanabhan commented 6 years ago

Thank You for the quick response. Ah i see , that does make sense to me. Since I am only concerned about sequences which have been clustered I can disregard the ambiguously clustered sequences and the counts including them. I will try to contribute with any suggestions on how to deal with this.

gui11aume commented 6 years ago

It seems to me that you are rather interested in the 'all-pairs' problem in which you would get the list of all pairs of primers matching at distance 8. Let me know if this is the case, I may have something for you...

gui11aume / starcode

clustering unique input sequences produces unexpected output #19