lskatz / mashtree

:deciduous_tree: Create a tree using Mash distances
GNU General Public License v3.0
156 stars 24 forks source link

How to interpret min-hash values in phylogeny #62

Closed Rob-murphys closed 3 years ago

Rob-murphys commented 3 years ago

I have generated pairwise min-hash values between 44 draft fungal genomes with your tool and am wanting to plot them into a network do display putative species and then correlate this with other methods to see if they all agree.

When generating the network I need to pick a cut off for edge values to show in the network but I am unsure how to interpret the min-hash values as a function of relatedness. It is my understanding that higher means more distant but what is a reasonable cut off value or how can I begin to determine what is?

Here is the table of pairwise min-hash values https://pastebin.com/tyEHnM6Y

lskatz commented 3 years ago

If you're doing something like a minimum spanning tree (MST) for your network, then I think that these distances are suitable. Mashtree does not produce a phylogeny and only clusters. In this sense, the output is actually very suitable for a network analysis.

Cutoffs are difficult to determine and I believe will change with each species or each subspecies. I would recommend plotting a histogram of these distances so that you can run outlier detection on them. One outlier detection technique is finding all distances 3 standard deviations from the mean. Hopefully it will be painfully obvious what your three standard devs are. E.g.,

wget https://pastebin.com/raw/tyEHnM6Y -O distances.tsv --no-check-certificate
tail -n +2 distances.tsv | perl -MStatistics::Descriptive -lane 'shift @F; push(@dist,@F); last; END{$stats = Statistics::Descriptive::Full->new(); $stats->add_data(@dist); print $stats->mean." ± ". $stats->standard_deviation; }'
# 0.165890760869565 ± 0.0611095302418224

Or maybe more to the point

tail -n +2 distances.tsv | perl -MStatistics::Descriptive -lane 'shift @F; push(@dist,@F); last; END{$stats = Statistics::Descriptive::Full->new(); $stats->add_data(@dist); $mean = $stats->mean; $stdev = $stats->standard_deviation; $low=$mean-3*$stdev; $high=$mean+3*$stdev; printf("Outliers are outside of %0.4f - %0.4f\n", $low, $high); }'
# Outliers are outside of -0.0174 - 0.3492

As always with statistics, take a look at the actual distribution on a graph to make sure it even looks normal or if there is anything funky with it.

Rob-murphys commented 3 years ago

When plotting the distances as a histogram I seem to get a fairly bimodal distribution which is a bit odd?: image

Would doing outlier detection not remove any potential true closely related vertices from the network?

I will look into Minimum spanning tree, this seems exactly like what I want!

lskatz commented 3 years ago

I think you're sufficiently on your way on this topic but it's outside the scope of this software