gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

allpairs output not congruent with sphere clustering output? #13

Open claudiuskerth opened 8 years ago

claudiuskerth commented 8 years ago

Hi Eduard,

I have run the latest starcode version from the allpairs branch on an input file of distinct sequences (same as referenced in "cluster size" bug report) with the command line:

starcode -d8 -t22 -i sort_uniqed_reads.txt -o starcode_edges.out

I count 96,071 distinct sequences (nodes) in the output file. There are 134,492 distinct sequences in the input file. So I infer from that 38,421 (134492 - 96,071) sequences are singletons, i. e. have no tau-match with any other sequence in the input.

To check that, I looked into the output from a run with the newest (i. e. cluster size corrected) version of starcode from the master branch created with the following command:

starcode -d8 -t22 -s --print-clusters -i sort_uniqed_reads.txt -o starcode.out

I assume that with sphere clustering, any sequence that does not have a tau-match will be put in a cluster on its own. Counting the number of "clusters" with only one sequence I get 41,908. Shouldn't this rather be 38,421?

many thanks for your help,

claudius

ezorita commented 8 years ago

Hi Claudius,

well, this is not necessarily true. Spheres clustering works as follows:

  1. Select the node with greater count (the cluster centroid).
  2. Visit its matching nodes and claim their count.
  3. Set count = 0 at the matching nodes.
  4. Repeat 1 until all nodes with count > 0 have been processed.

So now imagine that we have the following network after all pairs search:

A - B - C

i.e. A matches B, B matches C, but d(A,C) > d. If A,B and C have the same number of counts, step 1 will select a sequence at random. Therefore, if either A or C are selected first, they will claim B and the other will be left alone (thus generating a singleton). Otherwise if B is selected first, the three sequences will cluster together.

The point that you raised is interesting because starcode is not optimizing the centroids in spheres clustering. The optimal centroid choice would be the one that minimizes the final number of clusters and this can be achieved by modifying step 1:

  1. Select the node that generates a greater cluster.

Precomputing this is not straightforward given the current implementation of starcode but I will work on it during the following days.

Note, however, that even if the centroid choice is optimal the A,B,C example still applies and the number of singletons won't necessarily coincide with the result of all pairs search.

Thank you for contributing!

Eduard

claudiuskerth commented 8 years ago

Hi Eduard,

thank you for the clarification!

I know this is not the intended use case for starcode, but I try to cluster sequences that come from the same locus while avoiding as much as possible clustering sequences from different but similar (paralogous) loci. I don't know which graph clustering algorithm does best for this problem, but in any case I would also be interested in a raw clusters output from starcode, i. e. before applying any clustering algorithm that severs edges between nodes.

cheers

claudius

ezorita commented 8 years ago

Yes, I can make a raw version that prints all the edges found for each node. In the A,B,C example it would be something like: B A B C A B C B is this what you mean?

Eduard

claudiuskerth commented 8 years ago

Hi Eduard,

that's not quite what I mean. I am thinking of a "clustering" output with a canonical sequence, cluster size and list of child nodes (and grandchild nodes and great grandchild nodes ...), but from the raw clusters of the all-pairs search. For instance, if there was a path from every input sequence to any other input sequence, that would output just one big cluster. So, in principle, the cluster radius could be much bigger than tau if there were intermediate nodes between the canonical sequence and a sequence that is more than tau edits apart.

claudius

ezorita commented 8 years ago

I see, this feature should be easy to add. I am busy analyzing experimental data right now, but I will get back to this asap.

Eduard

ezorita commented 8 years ago

Please pull the update in master branch.

Eduard

claudiuskerth commented 8 years ago

Hi Eduard,

thank you for adding the --connected-comp output to starcode.

I am just wondering how the canonical sequence is selected for this output, i. e. random or sequence with highest coverage or even some kind of majority consensus?

claudius

ezorita commented 7 years ago

Sorry Claudius, I missed this one.

Yes, the canonical sequence with --connected-comp is the one in the cluster with more counts. In case of count tie, the sequence with greater number of edges is the canonical. If two or more sequences have maximum count and the same number of edges, one of them is returned at random.

Eduard

claudiuskerth commented 7 years ago

Thank you Eduard!