Creating new datasets with permutations

jlblancopastor commented 7 years ago

Hi, It is not indicated in the example file how to perform permutations to create new datasets for the stopping criterion test regarding the optimal number of clusters.

jefdaj commented 3 years ago

I would also be interested in an example!

I'm attempting to wrap treeCl for use in OrthoLang. I got the dependencies worked out yesterday and it should be close to running, but needs an example script for testing.

Am I right in thinking that the overall workflow would be to repeat the whole example.py for increasing numbers of clusters (1,2,3,4, ...), do the same thing with a randomly permuted input collection (like the code shown in this issue) for then pick the highest number of clusters where the scorer still returns a significantly higher score for the actual data than the shuffled data?

If that was done manually for the paper but would be about the right algorithm to automate, I'm actually interested in trying to automate it. Mainly because it sounds like a great example of using the "permute, repeat, summarize" pattern built into OrthoLang:

prs

... for something other than picking e-value cutoffs. EDIT: In this case the var being permuted would be n_clusters, the variables being summarized would be scores something like score = delta_k_actual - delta_k_sim, and the repeat_each function would be a slightly more complicated version like increment_while_score_changes.

Of course, if it's already automated that would be better and I would include your code rather than duplicating the effort. If you have a quick, ugly script lying around from generating the figure 3 data I would start from that.

kgori commented 3 years ago

Hi Jeff,

Am I right in thinking that the overall workflow would be to repeat the whole example.py for increasing numbers of clusters (1,2,3,4, ...), do the same thing with a randomly permuted input collection (like the code shown in this issue) for then pick the highest number of clusters where the scorer still returns a significantly higher score for the actual data than the shuffled data?

You're pretty much spot on here. Once you load a collection of alignments into treeCl you can make permuted copies using the treeCl.Collection.permuted_copy() method. These each become new, independent starting points for doing the whole analysis, and you use the resulting scores as an empirical null distribution for deciding whether your clustering of the original data is significant or not. And you would run this process sequentially for increasing numbers of clusters until there is no significant improvement gained by adding another cluster. It's quite compute intensive, but the mechanics of the process seem to fit well with the way you do things in OrthoLang.

I've not seen OrthoLang before, looks interesting. Let me know if there's anything useful I can do to help if you decide to support treeCl in OrthoLang.

Kevin

kgori / treeCl

Creating new datasets with permutations #12