graph-genome / Schematize

Visualization component of Pangenome Schematics for 1,000s of individuals and gigabase genomes.
http://graphgenome.org
Apache License 2.0
10 stars 8 forks source link

Phylogenetic Tree Visualization #58

Open josiahseaman opened 4 years ago

josiahseaman commented 4 years ago

For: Kaytie Innamorati @innamoratika

Goal: We want to show related individuals and blocks of individuals for viral sequences. We should be equivalent or harness information on nextstrain.org.

tpook92 commented 4 years ago

@innamoratika I used the HaploBlocker input to generate some phylogenetic trees by writting a same wrapper function for the R-package ape. Statistically not 100% sound but fine for some initial testing.

Here is how and unrooted tree for the 169 sars2 sequencences looks like: phylo_total

And here when removing the 7 most extreme outliers: phylo_162

innamoratika commented 4 years ago

@tpook92 Wonderful, thank you. I'll chat with the phylogeny folks in about 2 hours to discuss the genomes we want to include. Planning on using RaxML with some other programs, but will also use parts of the ape package to create a distance matrix and dendrogram.

josiahseaman commented 4 years ago

We'll need to change the specification for v13 JSON format. https://github.com/graph-genome/component_segmentation/issues/16 I think it's likely we'll only want one tree for the entire genome. We had discussed having a dendrogram for each gene or section of the genome. Haploblocker make a new row ordering per breakpoint (which is a recombination region (something the virus doesn't have)). @innamoratika do you see any reason that we might want more than one dendrogram for the whole pangenome? Would you ever want to do it per gene? Would that be useful to researchers?

subwaystation commented 4 years ago

I think https://github.com/neherlab/pan-genome-visualization already has SNP and gene trees. I know Richar Neher is an elaborated expert when it comes to viruses so I expect doing it per gene makes sense. We might be able to learn something from that tool, too.

tpook92 commented 4 years ago

I would assume that on a single gene level most SARS2 variants are just 100% the same. To get any differantiation between the sequences i would assume that using as much information as possible (usually the whole genome) should be the way to go.

When working on a more diverse set (e.g. including SARS1 / application in other species) single gene trees should be useful.