Closed lucaz88 closed 1 year ago
Hey @lucaz88,
The clustering of genomes in the pangenomic workflow takes place here in panops.py
:
(...)
for clustering_tuple in [('gene_cluster presence absence', self.view_data),
('gene_cluster frequencies', self.view_data_presence_absence)]:
v, d = clustering_tuple
newick = clustering.get_newick_tree_data_for_dict(d,
transpose=True,
distance = self.distance,
linkage=self.linkage)
(...)
If you follow the function get_newick_tree_data_for_dict
in panops.py
you will find that the default clustering uses 'euclidean' distance with 'ward' linkage on l1-normalized vectors by default, where linkage is calculated like this:
linkage = hierarchy.linkage(vectors, metric=distance, method=linkage)
Your first dendrogram is almost identical to the one in anvi'o (if you rotate things around a bit), and differences are likely due to the normalization step. I hope this helps.
Best wishes, Meren.
Thanks @meren !! I missed the normalization step! Now it looks identical without the need to do rotate anything 😉
Perfect :) Thank you for reporting back!
Hi,
I am trying to replicate the genome dendrogram that I obtained in anvio when selecting "gene_clusters (presence absence (tree)" using R. The goal is to test the stability of the branching e.g. by performing a bootstrap and identify a robust number of clusters. How is that tree/dendrogram actually built?
As input data in R, I am using the summary table generated with the command
anvi-summarize -g 4_pangenome/mygnm-GENOMES.db -p 4_pangenome/mcl_10_minocc_1/mygnm-PAN.db -C default -o 5_summary
which I attached to this thread (gene annotations removed) and I parse it into a genome X gene_cluster matrix with the following code:The tree from anvio looks like this:
however the closest that I can get to it is by using euclidean distance and ward.D2 agglomeration computed on the frequency matrix (genomes clustering is nearly identical, only Sp18 is in a different position)
plot(hclust(vegan::vegdist(gnm_genes_mat, method = "eucl"), method = "average"))
while if I use the presence/absence matrix, the tree looks quite different
Hope the question has not been already answered elsewhere but I couldn't find it. I am using Ubuntu 22.04.1 LTS and the anvio-dev version