pangenome / smoothxg

linearize and simplify variation graphs using blocked partial order alignment
Other
56 stars 6 forks source link

Multiple consensus paths #37

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Hi Erik

I've followed your PGGB updates and am now using smoothxg with added consensus paths. I've put the graph into VG and sequenceTubeMap. I notice I have 8 different consensus paths in my graph, and also when I VG deconstruct I have variants called on each of these consensus paths. Would I have multiple different consensus paths because of breaks and jumps in the graph, based on -C (which I have set to 10,100,1000,10000, like in PGGB). Would lowering the -C parameter reduce the number of consensus paths generated?

Would you mind explaining the benefit of having the consensus paths in the graph? Is it basically the graph collapsed down to represent all common regions across the pangenome? How could I use these paths to investigate the pangenome? From what I can see from the alignments, the consensus is representing the most common paths (core sequences). Would this be an accurate description of the censensus paths? Thanks.

ekg commented 3 years ago

Hi Brett,

The idea with the consensus graphs is to build lower-resolution versions of the pangenome that are still very "close" in terms of sequence content to the genomes in the graph. These low-resolution versions of the graph can help us inspect the graph in interactive systems, or compare it to other graphs. They are faster to work with than the full graph, which has advantages in many settings.

Right now, the consensus sequences are a kind of reference set of coordinates that cover the graph. The idea is that we can go from a low resolution graph to find the corresponding region of the full graph or MAF. We look up the consensus paths in the given region of the consensus graph at a given C threshold. We'd then subset the base graph to these consensus paths, or search in the MAF, etc.

There are some quirks. The consensus path set contains both the heaviest-bundle POA consensus paths from the block MSAs represented in the MAF file, and it contains "Link" paths that walk from the end of one path to the beginning of another, or include any sequence variation (approximately) greater than the given C threshold, but which would otherwise be fully contained in a given consensus. This latter part contains large SVs of all types. These alleles are aggregated progressively by working through the set of potential links in order of frequency and divergence from the reference.

The exact nomenclature, naming, and organization of this isn't fully implemented, and will probably evolve. The link paths aren't yet embedded in the graph, but they should be.

brettChapman commented 3 years ago

Thanks for the explanation. I'll keep an eye on how the use of these consensus paths evolves over time.