merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Replicate genome dendrogram #2091

Closed lucaz88 closed 1 year ago

lucaz88 commented 1 year ago

Hi,

I am trying to replicate the genome dendrogram that I obtained in anvio when selecting "gene_clusters (presence absence (tree)" using R. The goal is to test the stability of the branching e.g. by performing a bootstrap and identify a robust number of clusters. How is that tree/dendrogram actually built?

As input data in R, I am using the summary table generated with the command anvi-summarize -g 4_pangenome/mygnm-GENOMES.db -p 4_pangenome/mcl_10_minocc_1/mygnm-PAN.db -C default -o 5_summary which I attached to this thread (gene annotations removed) and I parse it into a genome X gene_cluster matrix with the following code:

gnm_genes <- read.delim("mygnm_gene_clusters_summary.txt")
gnm_genes_mat <- gnm_genes %>% 
[mygnm_gene_clusters_summary.txt](https://github.com/merenlab/anvio/files/12039608/mygnm_gene_clusters_summary.txt)

  mutate(cnt = 1) %>% 
  pivot_wider(id_cols = genome_name, names_from = gene_cluster_id, 
              values_from = cnt, values_fn = sum, values_fill = 0) %>% 
  column_to_rownames("genome_name")

The tree from anvio looks like this: image

however the closest that I can get to it is by using euclidean distance and ward.D2 agglomeration computed on the frequency matrix (genomes clustering is nearly identical, only Sp18 is in a different position)

plot(hclust(vegan::vegdist(gnm_genes_mat, method = "eucl"), method = "average")) image

while if I use the presence/absence matrix, the tree looks quite different

gnm_genes_mat_pa <- gnm_genes_mat
gnm_genes_mat_pa[gnm_genes_mat_pa > 1] <- 1
plot(hclust(vegan::vegdist(gnm_genes_mat_pa, method = "eucl"), method = "ward.D2"))

image

Hope the question has not been already answered elsewhere but I couldn't find it. I am using Ubuntu 22.04.1 LTS and the anvio-dev version

meren commented 1 year ago

Hey @lucaz88,

The clustering of genomes in the pangenomic workflow takes place here in panops.py:

(...)
for clustering_tuple in [('gene_cluster presence absence', self.view_data),
                         ('gene_cluster frequencies', self.view_data_presence_absence)]:
    v, d = clustering_tuple
    newick = clustering.get_newick_tree_data_for_dict(d, 
                                                      transpose=True,
                                                      distance = self.distance,
                                                      linkage=self.linkage)
(...)

If you follow the function get_newick_tree_data_for_dict in panops.py you will find that the default clustering uses 'euclidean' distance with 'ward' linkage on l1-normalized vectors by default, where linkage is calculated like this:

linkage = hierarchy.linkage(vectors, metric=distance, method=linkage)

Your first dendrogram is almost identical to the one in anvi'o (if you rotate things around a bit), and differences are likely due to the normalization step. I hope this helps.

Best wishes, Meren.

lucaz88 commented 1 year ago

Thanks @meren !! I missed the normalization step! Now it looks identical without the need to do rotate anything 😉

meren commented 1 year ago

Perfect :) Thank you for reporting back!