pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

Building a genus pangenome case #226

Open sivico26 opened 2 years ago

sivico26 commented 2 years ago

Hello there,

I want to build a pangenome for a genus of grasses (Hordeum) I am working on in order to do some analysis. I have been playing with pggb for a while now but I think is time to request some help.

The system

I have 7 genomes in total, which I classify in low (n = 3), medium (n = 2) and high (n = 2, but 3 haplotypes) divergence. All the species have 7 chromosomes and we expect high chromosomal macrosynteny (no chromosome fission/fissions in this system). This is an assumption though. The genome sizes vary slightly, but they are about 4Gb.

Following the divergence estimation tutorial, this is what I got.

        |------Low divergence genomes(3%)
|-------|6.5%
|       |------ Medium divergence genomes (5%)
|
|12.5%
|
|-----------High divergence genomes (8%)

So the number in the leaves is the divergence intra-category (e.g. the genomes in the low divergence group, present an average divergence of 3%), while the numbers on the nodes are inter-category divergence (e.g. any of the genomes of the high divergence group, compared to any of the genomes of low divergence, present an average divergence of 12.5%). I hope I made this clear.

This is consistent across chromosomes (i.e. the analysis run on each chromosome yields around the same numbers).

Building a pangenome graph

Smaller set

I started with a smaller set instead of doing everything at once, so picked the low divergence genomes for this purpose. I noticed that low values of segment-length (default of 5k, and 2k if I remember correctly) resulted in very high memory consumption. Reading the human pangenome draft preprint, I ended up with a command fairly similar to what they used:

threads="40"
outdir="out"
idt="90"  ## This could have been 95, but I did not know at the time.
nhaps="3"
match_chunk_size="100000" ## Imitating HPRC

pggb -i $input -n $nhaps -s $match_chunk_size -p $idt -M -v -t $threads -o $outdir

This built a pangenome for the 3 low divergence species in ~5 days and used 360 Gb. Here are some stats:

Put stats here
length: 7166445503
nodes: 223489144
edges: 300450324
paths: 21
num_weakly_connected_components: 1
weakly_connected_components: 
  - component:
      id: 0
      nodes: 223489144
      is_acyclic: 'no'
num_nodes_self_loops:
  total: 2
  unique: 2
A: 1911384217
C: 1673076500
G: 1671083099
N: 80300
T: 1910821387
file_size_in_bytes: 45236372775
mean_links_length:
  - length:
      path: all_paths
      in_node_space: 142.968
      in_nucleotide_space: 4661.92
      num_links_considered: 407507032
      num_gap_links_not_penalized: 0
sum_of_path_node_distances:
  - distance:
      path: all_paths
      in_node_space: 284.06
      in_nucleotide_space: 333.027
      nodes: 407507053
      nucleotides: 11322170845
      num_penalties: 145255865
      num_penalties_different_orientation: 0

I have been exploring it a bit, but honestly, I do not know what to expect (my first pangenome). After reading #187, I realized I misunderstood the effect of the segment-length parameter and 100k might be too high for my system. I am worried that the pangenome could be underaligned. So I think my first question would be how can I check for under-alignment (or proper alignment) of my pangenome?

Complete set

Then, I tried to build another pangenome with all the species. This is the command I ended up using:

outdir="out" 
idt="85"
nhaps="8" ## 7 genomes, but there is one that has 2 Haplotypes
match_chunk_size="200000" ## I now know this was a step in the wrong direction

pggb -i $input.gz -n $nhaps -s $match_chunk_size -p $idt -P asm20 -M -v -t $threads -o $outdir

I included -P asm20 because I thought it would have been better than the default asm5 (according to the divergence)

This has been running for >500h (~20 days) and it is still on the mapping step. Memory wise it has been very gentle, although I imagine it is because of the stringent mapping criteria. Also, if I understand correctly now, the result is very likely to be even more underaligned than the previous one, right? Assuming that the previous one already was to some degree.

So my second question is, how can I properly parameterize pggb to be more efficient and scale better for this system? I currently see two (maybe complementary) approaches:

The only thing I do not like about the second approach is that, as you have said elsewhere, I would miss the interchromosomal rearrangements. So my third question would be: is there a way to recover the interchromosomal rearrangements information that you would miss using this divide-by-chromosome approach? Or if you want that information you necessarily have to build an All vs All graph? I imagine that something like a graph vs graph aligner would be needed to recover the interchromosomal interactions, but I do not know if something like that exists.

Scientific interest

We are interested in this group because we have previous data that suggest there have been horizontal gene transfers (HGT) from another group of plants (Panicum) to Hordeum. We reasoned that with a pangenomics approach, such transfers would emerge in the pangenome graph as large structural variations that do not follow the clade phylogeny (How to traverse the graph looking for such SVs with phylogenetic/evolutionary considerations in mind is something I would like to ask you as well, but I will leave that for another issue since this one is already big).

At the very start, we thought that making a Hordeum pangenome and looking for the aforementioned SVs would yield some candidates that we could later compare to the supposed genomic source (Panicum) with some traditional methods. But then it struck me: if we could include Panicum (the source) in the graph, the HGTs would appear naturally in the graph as connections between the genomes. However, Panicum diverged from Hordeum a long time ago, so here we are talking about a different number of chromosomes likely with plenty of chromosome fusions and fissions.

So my fourth question is: do you think this is possible/feasible? How much divergence can you manage when building a pangenome/variational graph? I am not sure if we could still talk about a pangenome graph, but it would be very useful for us to have such variational graph. If there is a way, I suspect pggb would not be the best way of doing it, because including Panicum would force you to set the parameters very differently from what would be efficient for Hordeum alone (e.g. lowering too much the percent identity). In fact, I think that is already happening with the system I depicted above.

In this regard, after building the Hordeum pangenome with pggb, Panicum could be added using something like minigraph. The result would lack the path information from Panicum, and I imagine I would need that to transverse the graph and look for the HGTs, so that must be added as well.

I noticed that the HPRC achieved this using an adaptation of Cactus suitable for populations (Do you know if there are updates on the preprint/pipeline they were preparing for this?). Reading about the MC method, I could not stop thinking that Cactus was already suitable for handling my system (different species, clear phylogenetic relationships that could guide the alignment), so there should be a way to adapt the MC method to interspecies ~pangenome~ variational graphs, right? I guess my fifth question is what do you think about this? Do you think it is feasible? Do you think it would be hard to implement? Do you know if there is someone working on it?

Summary and questions

All right, that was a lot of information and questions, I apologize for making this issue so big, I would retrieve all the questions here for easier access:

  1. How can one check for under-alignment (or proper alignment) of a given pangenome graph?
  2. How to efficiently build an inter-species pangenome with pggb (parametrization, strategies)?
  3. If the pangenome is built with a divide-by-chromosome strategy, is it possible to recover/reintroduce the interchromosomal interactions to the pangenome? Or to get that information you necessarily have to build with an All-vs-All strategy?
  4. How divergent can you go when building a pangenomic/variational graph? Do you think this would be a good representation of the interactions between the genomes and would show, for instance, HGTs?
  5. Do you think the MC pipeline (or more precisely, their components) could be adapted to build inter-species pangenome/variational graphs?

Thank you very much for reading all this. I am quite sure with all this information there will be questions or not everything was clear so please ask. I am looking forward to discussing this.

Kind regards Sivico

sivico26 commented 1 year ago

For further reference, to answer question 5. (related to 2. as well): cactus seems to be the way to go since its Progressive-Alignment algorithm can take advantage of the phylogenetic relationships to make the building process of linear complexity (instead of quadratic). The .hal output can then be converted to a pangenome graph.