pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

Use of pggb to generate alignments for phylogenomics #319

Open sivico26 opened 1 year ago

sivico26 commented 1 year ago

Dear pggb team,

I wanted to ask your thoughts on a problem I am facing currently.

Suppose I have sequences for a given locus (of some ~100s kb) for several taxa of the same genera. The genera is young so the divergence between the species is rather low (between 1 - 2 %, according to mash). In the end, what I want is to do some phylogenetic inference, so what I need is a MSA of these sequences. Since a variational graph is equivalent to an MSA, I thought I might build a graph of the locus with pggb and from there get the MSA alignments of homologous regions somehow. Do you think this is feasible? How would you do it? Or how would you approach the problem?

I have been playing around this idea with some of my data and this is what I have discovered so far:

First I think I managed to get a reasonably linear topology with the following command:

pggb -i $input -s 10000 -p 97 -n 7 -Y "|" -S -m -M -t $threads -o $out

input fasta gz b959677 417fcdf 9eb3dc6 smooth final og lay draw

In practice, the sequences that I have are not a single contiguous contig (T2T like being the ideal case), but rather assemblies with some degree of fragmentation (and maybe overlap?). That means that the paths for a given taxon are disconnected in the graph, since pggb treats each contig as a separate path. If you look at the topology, you can see there are several dead ends that pop out the backbone of the graph (I think those are the extremes of the contigs that are kind of soft-clipping). That disconnection at the alignment level looks like this:

input fasta gz b762553 417fcdf f50f1bb smooth final og viz_multiqc

If you manage to see the prefixes in the path names on the left, you would notice (with some effort) that merging per taxa you kind of cover most of the reference (which is the last row) with some level of redundancy (there are certain regions where a given taxa has two contigs, which is problematic). However, that is something we visually infer but it is not specified anywhere. I think to make a whole alignment of the taxa in the regions we would have to somehow connect this components. This seems sort of equivalent to make some scaffolding of the assemblies based on the graph topology. If I understand correctly, the J lines in the gfa format spec. fit this purpose, but I am not sure if they can be generated or supported by downstream programs (e.g. odgi).

Anyway, even supposing you can connect those components, how do you go from there to MSA alignments suitable for phylogenetic inference?

I would really appreciate your thoughts on this issue. Let me know if you have any more inquiries. I think I left many details out.

Thanks in advance, Sivico

Isoris commented 11 months ago

I would like to get an answer as well .

subwaystation commented 11 months ago

Hi @sivico26 @Isoris,

my recommendation would be to rename the input sequences so their naming follows the https://github.com/pangenome/PanSN-spec. The delimiter of choice for PGGB is #. You can then use odgi similarity to calculate a distance matrix based on your species. Telling odgi similarity that # is your delimiter of choice, it is able to view several paths as one species. For an example and R code please take a look at https://hackmd.io/@AndreaGuarracino/SyhbiKuE2#Primate-chromosome-6.

However, if you really want some kind of MSA output you can activate a PGGB argument:

    -M, --write-maf             write MAF output representing merged POA blocks [default: off]

You will get a MAF file for each local MSA in the graph.

I hope this helps!