Closed Rob-murphys closed 3 years ago
If you're doing something like a minimum spanning tree (MST) for your network, then I think that these distances are suitable. Mashtree does not produce a phylogeny and only clusters. In this sense, the output is actually very suitable for a network analysis.
Cutoffs are difficult to determine and I believe will change with each species or each subspecies. I would recommend plotting a histogram of these distances so that you can run outlier detection on them. One outlier detection technique is finding all distances 3 standard deviations from the mean. Hopefully it will be painfully obvious what your three standard devs are. E.g.,
wget https://pastebin.com/raw/tyEHnM6Y -O distances.tsv --no-check-certificate
tail -n +2 distances.tsv | perl -MStatistics::Descriptive -lane 'shift @F; push(@dist,@F); last; END{$stats = Statistics::Descriptive::Full->new(); $stats->add_data(@dist); print $stats->mean." ± ". $stats->standard_deviation; }'
# 0.165890760869565 ± 0.0611095302418224
Or maybe more to the point
tail -n +2 distances.tsv | perl -MStatistics::Descriptive -lane 'shift @F; push(@dist,@F); last; END{$stats = Statistics::Descriptive::Full->new(); $stats->add_data(@dist); $mean = $stats->mean; $stdev = $stats->standard_deviation; $low=$mean-3*$stdev; $high=$mean+3*$stdev; printf("Outliers are outside of %0.4f - %0.4f\n", $low, $high); }'
# Outliers are outside of -0.0174 - 0.3492
As always with statistics, take a look at the actual distribution on a graph to make sure it even looks normal or if there is anything funky with it.
When plotting the distances as a histogram I seem to get a fairly bimodal distribution which is a bit odd?:
Would doing outlier detection not remove any potential true closely related vertices from the network?
I will look into Minimum spanning tree, this seems exactly like what I want!
I think you're sufficiently on your way on this topic but it's outside the scope of this software
I have generated pairwise min-hash values between 44 draft fungal genomes with your tool and am wanting to plot them into a network do display putative species and then correlate this with other methods to see if they all agree.
When generating the network I need to pick a cut off for edge values to show in the network but I am unsure how to interpret the min-hash values as a function of relatedness. It is my understanding that higher means more distant but what is a reasonable cut off value or how can I begin to determine what is?
Here is the table of pairwise min-hash values https://pastebin.com/tyEHnM6Y