simonhmartin / twisst

Topology weighting by iterative sampling of sub-trees
GNU General Public License v3.0
70 stars 18 forks source link

suggestion for plot.twisst in R with > 5 sp #18

Closed stubbsrl closed 2 years ago

stubbsrl commented 4 years ago

In R, plot.twisst works great with 5 or less species, but when I add an outgroup the image becomes unreadable due to the number of possible topologies. Would it be possible to set plot.twisst to only use a set number of the most heavily weighted topologies for visualizing? When looking at the boxplots of weights there is consistently <10 topologies that are making any sort of notable contribution.

My solution at the moment is clunky. I was able to subset the weights file by top weighted columns in R.

data <- read.table(original_weights_file, header=TRUE) best_topos <- colnames(data)[sort.list(colSums(data[,-1]), decreasing=TRUE)[1:10] + 1] new_weights <- data[best_topos] write.table(new_weights, "top10.weights.tsv", quote = FALSE)

Then manually add back in the top 10 topologies and read this new file in as the weights file. From this method, I can get a readable boxplot with topologies (albeit the numbers are now slightly skewed because I'm only using 10 topologies, not 105). But, this way is cumbersome and looks to cause problems with plot.twisst (topologies were mismatched). Is there another solution?

Thanks, Rebecca

simonhmartin commented 4 years ago

Hi Rebecca, Nice idea. How about a function called subset.twisst.by.topos, which returns the twisst object with only the topos you want? Simon

stubbsrl commented 4 years ago

Yes, that would be great! Sounds more efficient than what I was trying.

simonhmartin commented 4 years ago

Ok I've already done it. The latest version of plot_twisst.R should now include two subsetting functions that can be used as follows:

################## subset to only the most abundant topologies #################

#get list of the most abundant topologies (top 2 in this example)
top2_topos <- order(twisst_data$weights_overall_mean, decreasing=T)[1:2]

#subset twisst object for these
twisst_data_top2topos <- subset.twisst.by.topos(twisst_data, top2_topos)
#this can then be used in all the same plotting functions above.

###################### subset to only specific regions #########################

#regions to keep (more than one can be specified)
regions <- c("contig0")

#subset twisst object for these
twisst_data_contig0 <- subset.twisst.by.regions(twisst_data, regions)
#this can then be used in all the same plotting functions above.
simonhmartin commented 4 years ago

If you could let me know whether this works for you without errors that would be great.

stubbsrl commented 4 years ago

This is very cool! I subset my data as both best 2 and best 10 topologies. Both worked perfectly in plot.twisst.summary.boxplot, plot.weights, plot.twisst, and smooth.twisst.

The subset by region is also great - I had been going back to my VCF to pull out different scaffolds, create a new geno file, run twisst, repeat. So, this is another useful addition - thanks!

I was able to subset by just region and everything worked.

What did not work is if I tried both together, i.e., subset by region and then subset the top topologies. The output was the first 10 topologies (not the best 10), but the region subset still worked correctly. This is how I ran it:

twisst_data <- import.twisst(weights_file, data_file) regions <- c("pve_haplotypeT_001") twisst_data_contig1 <- subset.twisst.by.regions(twisst_data, regions) top_topos <- order(twisst_data_contig1$weights_overall_mean, decreasing=T)[1:10] twisst_data_toptopos <- subset.twisst.by.topos(twisst_data_contig1, top_topos)

I tried it the other way where I first subset topologies then subset regions. This also did not work as it gave me all topologies (but again the specified region was correct).

simonhmartin commented 4 years ago

Thanks for the exhaustive tests! I can't immediately figure out why using both together causes problems. I will give it some more thought when I have a chance and get back to you.