Open sivico26 opened 2 years ago
For further reference, to answer question 5. (related to 2. as well): cactus
seems to be the way to go since its Progressive-Alignment algorithm can take advantage of the phylogenetic relationships to make the building process of linear complexity (instead of quadratic). The .hal
output can then be converted to a pangenome graph.
Hello there,
I want to build a pangenome for a genus of grasses (Hordeum) I am working on in order to do some analysis. I have been playing with
pggb
for a while now but I think is time to request some help.The system
I have 7 genomes in total, which I classify in low (n = 3), medium (n = 2) and high (n = 2, but 3 haplotypes) divergence. All the species have 7 chromosomes and we expect high chromosomal macrosynteny (no chromosome fission/fissions in this system). This is an assumption though. The genome sizes vary slightly, but they are about 4Gb.
Following the divergence estimation tutorial, this is what I got.
So the number in the leaves is the divergence intra-category (e.g. the genomes in the low divergence group, present an average divergence of 3%), while the numbers on the nodes are inter-category divergence (e.g. any of the genomes of the high divergence group, compared to any of the genomes of low divergence, present an average divergence of 12.5%). I hope I made this clear.
This is consistent across chromosomes (i.e. the analysis run on each chromosome yields around the same numbers).
Building a pangenome graph
Smaller set
I started with a smaller set instead of doing everything at once, so picked the low divergence genomes for this purpose. I noticed that low values of
segment-length
(default of 5k, and 2k if I remember correctly) resulted in very high memory consumption. Reading the human pangenome draft preprint, I ended up with a command fairly similar to what they used:This built a pangenome for the 3 low divergence species in ~5 days and used 360 Gb. Here are some stats:
Put stats here
I have been exploring it a bit, but honestly, I do not know what to expect (my first pangenome). After reading #187, I realized I misunderstood the effect of the
segment-length
parameter and 100k might be too high for my system. I am worried that the pangenome could be underaligned. So I think my first question would be how can I check for under-alignment (or proper alignment) of my pangenome?Complete set
Then, I tried to build another pangenome with all the species. This is the command I ended up using:
I included
-P asm20
because I thought it would have been better than the default asm5 (according to the divergence)This has been running for >500h (~20 days) and it is still on the mapping step. Memory wise it has been very gentle, although I imagine it is because of the stringent mapping criteria. Also, if I understand correctly now, the result is very likely to be even more underaligned than the previous one, right? Assuming that the previous one already was to some degree.
So my second question is, how can I properly parameterize
pggb
to be more efficient and scale better for this system? I currently see two (maybe complementary) approaches:The only thing I do not like about the second approach is that, as you have said elsewhere, I would miss the interchromosomal rearrangements. So my third question would be: is there a way to recover the interchromosomal rearrangements information that you would miss using this divide-by-chromosome approach? Or if you want that information you necessarily have to build an
All vs All
graph? I imagine that something like a graph vs graph aligner would be needed to recover the interchromosomal interactions, but I do not know if something like that exists.Scientific interest
We are interested in this group because we have previous data that suggest there have been horizontal gene transfers (HGT) from another group of plants (Panicum) to Hordeum. We reasoned that with a pangenomics approach, such transfers would emerge in the pangenome graph as large structural variations that do not follow the clade phylogeny (How to traverse the graph looking for such SVs with phylogenetic/evolutionary considerations in mind is something I would like to ask you as well, but I will leave that for another issue since this one is already big).
At the very start, we thought that making a Hordeum pangenome and looking for the aforementioned SVs would yield some candidates that we could later compare to the supposed genomic source (Panicum) with some traditional methods. But then it struck me: if we could include Panicum (the source) in the graph, the HGTs would appear naturally in the graph as connections between the genomes. However, Panicum diverged from Hordeum a long time ago, so here we are talking about a different number of chromosomes likely with plenty of chromosome fusions and fissions.
So my fourth question is: do you think this is possible/feasible? How much divergence can you manage when building a pangenome/variational graph? I am not sure if we could still talk about a pangenome graph, but it would be very useful for us to have such variational graph. If there is a way, I suspect
pggb
would not be the best way of doing it, because including Panicum would force you to set the parameters very differently from what would be efficient for Hordeum alone (e.g. lowering too much the percent identity). In fact, I think that is already happening with the system I depicted above.In this regard, after building the Hordeum pangenome with
pggb
, Panicum could be added using something likeminigraph
. The result would lack the path information from Panicum, and I imagine I would need that to transverse the graph and look for the HGTs, so that must be added as well.I noticed that the HPRC achieved this using an adaptation of
Cactus
suitable for populations (Do you know if there are updates on the preprint/pipeline they were preparing for this?). Reading about the MC method, I could not stop thinking thatCactus
was already suitable for handling my system (different species, clear phylogenetic relationships that could guide the alignment), so there should be a way to adapt the MC method to interspecies ~pangenome~ variational graphs, right? I guess my fifth question is what do you think about this? Do you think it is feasible? Do you think it would be hard to implement? Do you know if there is someone working on it?Summary and questions
All right, that was a lot of information and questions, I apologize for making this issue so big, I would retrieve all the questions here for easier access:
pggb
(parametrization, strategies)?Thank you very much for reading all this. I am quite sure with all this information there will be questions or not everything was clear so please ask. I am looking forward to discussing this.
Kind regards Sivico