sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

support output of newick format from sourmash compare #915

Open ctb opened 4 years ago

ctb commented 4 years ago

see https://github.com/dib-lab/charcoal/commit/4838253a5677e3c64e50cd621308772116bbea81 for example code

ctb commented 2 years ago

cc @mr-eyes sounds like you have some code here? :)

mr-eyes commented 2 years ago

This snippet will convert the output of sourmash compare to newick format.

Step 1) construct the distance matrix

sourmash compare *sig --csv smash_compare_distmat.csv

Step 2) Perform complete-linkage hierarchical clustering

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import to_tree
import pandas as pd

# Thanks to https://stackoverflow.com/a/31878514/3371177
def get_newick(node, parent_dist, leaf_names, newick='') -> str:
    if node.is_leaf():
        return "%s:%.2f%s" % (leaf_names[node.id], parent_dist - node.dist, newick)
    else:
        if len(newick) > 0:
            newick = "):%.2f%s" % (parent_dist - node.dist, newick)
        else:
            newick = ");"
        newick = get_newick(node.get_left(), node.dist, leaf_names, newick=newick)
        newick = get_newick(node.get_right(), node.dist, leaf_names, newick=",%s" % (newick))
        newick = "(%s" % (newick)
        return newick

distMat_csv = "smash_compare_distmat.csv"
df = pd.read_csv(distMat_csv, sep = ',')
names = list(df.columns[0:])
distances = df[df.columns[:]].to_numpy()
Z = linkage(distances, 'complete')
tree = to_tree(Z, False)
newick = get_newick(tree, tree.dist, names)
with open(distMat_csv + ".newick", 'w') as NW:
    NW.write(newick)

We need to get rid of the dependencies that can't be added to sourmash. Also, adding multiple clustering options would be nice like Bray-Curtis, single-linkage, average-linkage, and others...

ctb commented 2 years ago

the only dependency there that's not already part of sourmash is pandas!

mr-eyes commented 2 years ago

No pandas :)

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import to_tree
import sys

# Thanks to https://stackoverflow.com/a/31878514/3371177
def get_newick(node, parent_dist, leaf_names, newick='') -> str:
    if node.is_leaf():
        return "%s:%.2f%s" % (leaf_names[node.id], parent_dist - node.dist, newick)
    else:
        if len(newick) > 0:
            newick = "):%.2f%s" % (parent_dist - node.dist, newick)
        else:
            newick = ");"
        newick = get_newick(node.get_left(), node.dist, leaf_names, newick=newick)
        newick = get_newick(node.get_right(), node.dist, leaf_names, newick=",%s" % (newick))
        newick = "(%s" % (newick)
        return newick

distMat_csv = "smash_compare_distmat.csv"
distances = list()
names = list()
with open(distMat_csv) as CSV:
    names = next(CSV).strip().split(',')
    for line in CSV:
        row = map(float, line.strip().split(','))
        distances.append(list(row))

Z = linkage(distances, 'complete')
tree = to_tree(Z, False)
newick = get_newick(tree, tree.dist, names)
with open(distMat_csv + ".newick", 'w') as NW:
    NW.write(newick)
mr-eyes commented 2 years ago

BTW, this should achieve similar results to kSpider.

mr-eyes commented 2 years ago

https://itol.embl.de/ is a super cool tool to visualize newick,

ctb commented 1 year ago

ref https://github.com/sourmash-bio/pyo3_branchwater/issues/111