Open ctb opened 4 years ago
cc @mr-eyes sounds like you have some code here? :)
This snippet will convert the output of sourmash compare
to newick format.
Step 1) construct the distance matrix
sourmash compare *sig --csv smash_compare_distmat.csv
Step 2) Perform complete-linkage hierarchical clustering
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import to_tree
import pandas as pd
# Thanks to https://stackoverflow.com/a/31878514/3371177
def get_newick(node, parent_dist, leaf_names, newick='') -> str:
if node.is_leaf():
return "%s:%.2f%s" % (leaf_names[node.id], parent_dist - node.dist, newick)
else:
if len(newick) > 0:
newick = "):%.2f%s" % (parent_dist - node.dist, newick)
else:
newick = ");"
newick = get_newick(node.get_left(), node.dist, leaf_names, newick=newick)
newick = get_newick(node.get_right(), node.dist, leaf_names, newick=",%s" % (newick))
newick = "(%s" % (newick)
return newick
distMat_csv = "smash_compare_distmat.csv"
df = pd.read_csv(distMat_csv, sep = ',')
names = list(df.columns[0:])
distances = df[df.columns[:]].to_numpy()
Z = linkage(distances, 'complete')
tree = to_tree(Z, False)
newick = get_newick(tree, tree.dist, names)
with open(distMat_csv + ".newick", 'w') as NW:
NW.write(newick)
We need to get rid of the dependencies that can't be added to sourmash. Also, adding multiple clustering options would be nice like Bray-Curtis, single-linkage, average-linkage, and others...
the only dependency there that's not already part of sourmash is pandas!
No pandas :)
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import to_tree
import sys
# Thanks to https://stackoverflow.com/a/31878514/3371177
def get_newick(node, parent_dist, leaf_names, newick='') -> str:
if node.is_leaf():
return "%s:%.2f%s" % (leaf_names[node.id], parent_dist - node.dist, newick)
else:
if len(newick) > 0:
newick = "):%.2f%s" % (parent_dist - node.dist, newick)
else:
newick = ");"
newick = get_newick(node.get_left(), node.dist, leaf_names, newick=newick)
newick = get_newick(node.get_right(), node.dist, leaf_names, newick=",%s" % (newick))
newick = "(%s" % (newick)
return newick
distMat_csv = "smash_compare_distmat.csv"
distances = list()
names = list()
with open(distMat_csv) as CSV:
names = next(CSV).strip().split(',')
for line in CSV:
row = map(float, line.strip().split(','))
distances.append(list(row))
Z = linkage(distances, 'complete')
tree = to_tree(Z, False)
newick = get_newick(tree, tree.dist, names)
with open(distMat_csv + ".newick", 'w') as NW:
NW.write(newick)
BTW, this should achieve similar results to kSpider.
https://itol.embl.de/ is a super cool tool to visualize newick,
see https://github.com/dib-lab/charcoal/commit/4838253a5677e3c64e50cd621308772116bbea81 for example code