vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Modifying and joining graphs #2899

Open brettChapman opened 4 years ago

brettChapman commented 4 years ago

Hi

I've been running Cactus on 3 different Barley genomes (planning to scale up to more genomes later). I've been running each job to create alignments of each chromosome from 1 to 7. I'm breaking the jobs per chromosome because I don't have enough memory to perform a whole genome alignment across multiple genomes. I'm running each node of my cluster with only 16 cores and 64GB RAM, and splitting each chromosome onto a different node.

As a result, I will completely miss structural rearrangements across different chromosomes, only identifying indels and snps.

I have seen there are join, concact and mod functions of VG and I was wondering if it is possible to join all the graphs for each chromosome afterwards, and then modify the whole graph structure to include large structural rearrangements, identified by other means, such as pairwise alignments to a reference genome and variant calls, or possibly simulating reads from the whole complete genome graph using the sim function and mapping them back to identify any large structural variants missed previously. I see that there is an augment function, but it would add new paths. I would like to modify current paths in the graphs to include larger structural variants which were missed. Is that possible?

I also plan to include many more genomes in my alignment (~20), but will likely have to break them down into subgroups to perform the alignments using the resources I have. Is it possible to merge all these graphs from multiple genome alignments. My understanding is that it would be possible, as long as the root genome is present in all alignments. I could identify the root genome from the guide tree which I would generate first from a phylogenetic analysis. I have been using Mashtree to generate the newick files for this purpose. If no root genome is provided, Cactus infers one and calls it Anc0. But I could also identify one myself by running Mashtree on all 20 genomes.

Cactus also has a cactus-prepare function to break down the alignments into sub-problems, but even when running with 3 genomes, it still wants to align all 3 genomes at once, pushing the resource requirements above what I have. Therefore I have no choice but to split my problem up by chromosome, and merge graphs at the end.

Thanks for any help you can provide.

ekg commented 4 years ago

I would suggest exploring seqwish in parallel. It should be no problem to combine all the whole genomes in one step on a single node in <64G of RAM (it's disk-backed).

I've been then using tools in odgi (also in vgteam) to process and interrogate the graph afterwards. It's possible to use these graphs in vg just as with Cactus.

On Fri, Jul 10, 2020, 04:46 Brett Chapman notifications@github.com wrote:

Hi

I've been running Cactus on 3 different Barley genomes (planning to scale up to more genomes later). I've been running each job to create alignments of each chromosome from 1 to 7. I'm breaking the jobs per chromosome because I don't have enough memory to perform a whole genome alignment across multiple genomes. I'm running each node of my cluster with only 16 cores and 64GB RAM, and splitting each chromosome onto a different node.

As a result, I will completely miss structural rearrangements across different chromosomes, only identifying indels and snps.

I have seen there are join, concact and mod functions of VG and I was wondering if it is possible to join all the graphs for each chromosome afterwards, and then modify the whole graph structure to include large structural rearrangements, identified by other means, such as pairwise alignments to a reference genome and variant calls, or possibly simulating reads from the whole complete genome graph using the sim function and mapping them back to identify any large structural variants missed previously. I see that there is an augment function, but it would add new paths. I would like to modify current paths in the graphs to include larger structural variants which were missed. Is that possible?

I also plan to include many more genomes in my alignment (~20), but will likely have to break them down into subgroups to perform the alignments using the resources I have. Is it possible to merge all these graphs from multiple genome alignments. My understanding is that it would be possible, as long as the root genome is present in all alignments. I could identify the root genome from the guide tree which I would generate first from a phylogenetic analysis. I have been using Mashtree to generate the newick files for this purpose. If no root genome is provided, Cactus infers one and calls it Anc0. But I could also identify one myself by running Mashtree on all 20 genomes.

Cactus also has a cactus-prepare function to break down the alignments into sub-problems, but even when running with 3 genomes, it still wants to align all 3 genomes at once, pushing the resource requirements above what I have. Therefore I have no choice but to split my problem up by chromosome, and merge graphs at the end.

Thanks for any help you can provide.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIID3E3FJAAZK4XGSDR2Z6JNANCNFSM4OWFOJUA .

brettChapman commented 4 years ago

Hi Erik

I'm familiar with Seqwish, as I have used it to process my HAL files outputted from Cactus, however I'm not completely familiar with all it's capabilities.

Is it non-reference based like Cactus? One reason I chose Cactus for the alignment is it's ability to prevent reference bias in variant calls. Are there any requirements for the sequences such as soft and hard masking? Cactus required me to soft mask my genome sequences beforehand,

Can it easily scale to multiple large plant genomes? My Barley genomes are around 4.7GBps each in size, they are all orientated into pseudomolecules, so complete chromosomes with gaps filled with Ns, with some ChrUn and unassigned contigs/scaffolds. I plan to align 20 genomes.

When you say Seqwish is disk-backed, does that mean I would need ample disk space instead of memory?

Thanks for your help.

glennhickey commented 4 years ago

You can indeed combine your chromosome graphs with vg ids -j then vg combine or vg index -x. The resulting whole-genome graph will probably squeeze into 64G of RAM, but that likely wouldn't be enough to do any processing on it.

Aligning contigs to this graph, then using vg augment on the alignments could in theory add variation into the graph, while replacing your paths. But I'm skeptical that it'd work well here even if you did have sufficient memory.

Both cactus and seqwish offer up support for merging alignments. For Cactus, one alignment's root should correspond to another's leaf (halAppendSubtree). For Seqwish, you just need a pairwise alignment between at least one genome from each set you want to merge.

In general, constructing pangenome graphs from genome alignments is an open problem that we (cactus/vg/seqwish developers, among others) are actively working on. And we're grappling with many of these issues, so this kind of discussion is always very welcome.

brettChapman commented 4 years ago

Thanks for your feedback.

I've decided to try and align the entire genome and not split on chromosome as I'll be missing out on a lot of important information.

I'll try aligning with Cactus, minimap2 and also GSAlign which I recently came across and is good for intra-species alignment, which is useful for my study as I'm aligning Hordeum vulgare (Barley).

The halAppendSubtree would be useful if I were to first align only subtrees and then follow root to leaf merging based on the whole tree. If I were to use Seqwish and say minimap2 for multiple pairwise alignments, how would selecting the reference genome impact the graph? Should I choose a base root genome as my reference? In my case I have the Morex genome which was the first sequenced and assembled Barley genome and has been used as a reference often in the literature, however in terms of pangenome phylogenetic trees, it is not the root. I've attached some trees to show how it looks (output from mashtree, with mash distances and bootstrap values). I've also included a smaller study which is looking at Morex vs 2 wild type Barley varieties. The tree isn't as informative, as there are only 3 genomes, and so it is impossible to break it down into smaller alignment jobs (running cactus on 3 whole genomes I ran out of memory, and I recently tried running just 2 genomes and ran out of hard drive space, using up 2T of space even before alignments start, at the blast stage, so I may need to up my space requirements).

I'm thinking for my smaller 3 genome study I'll use seqwish with minimap2/GSAlign and for my 20 genome study, I'll use Cactus, but with alignments broken down into 2 genomes (if I manage to get it to complete, still to be tested), in some cases alignments between Anc genome with one of the leaves, and in each case halAppendTree is ran to merge the HAL files. I know cactus-prepare can be run to output all the commands, so I could break it all down that way.

Based on my trees, do you think that would be a good approach for both my studies? Thanks for your help.

Morex_Damai_distance

pangenome_tree_distance

pangenome_tree_bootstrap

ekg commented 4 years ago

By minimap2 do you mean seqwish?

You can structure the alignments by a tree or just align all to all.

On Wed, Jul 22, 2020, 05:24 Brett Chapman notifications@github.com wrote:

Thanks for your feedback.

I've decided to try and align the entire genome and not split on chromosome as I'll be missing out on a lot of important information.

I'll try aligning with Cactus, minimap2 and also GSAlign which I recently came across and is good for intra-species alignment, which is useful for my study as I'm aligning Hordeum vulgare (Barley).

The halAppendSubtree would be useful if I were to first align only subtrees and then follow root to leaf merging based on the whole tree. If I were to use Seqwish and say minimap2 for multiple pairwise alignments, how would selecting the reference genome impact the graph? Should I choose a base root genome as my reference? In my case I have the Morex genome which was the first sequenced and assembled Barley genome and has been used as a reference often in the literature, however in terms of pangenome phylogenetic trees, it is not the root. I've attached some trees to show how it looks (output from mashtree, with mash distances and bootstrap values). I've also included a smaller study which is looking at Morex vs 2 wild type Barley varieties. The tree isn't as informative, as there are only 3 genomes, and so it is impossible to break it down into smaller alignment jobs (running cactus on 3 whole genomes I ran out of memory, and I recently tried running just 2 genomes and ran out of hard drive space, using up 2T of space even before alignments start, at the blast stage, so I may need to up my space requirements).

I'm thinking for my smaller 3 genome study I'll use seqwish with minimap2/GSAlign and for my 20 genome study, I'll use Cactus, but with alignments broken down into 2 genomes (if I manage to get it to complete, still to be tested), in some cases alignments between Anc genome with one of the leaves, and in each case halAppendTree is ran to merge the HAL files. I know cactus-prepare can be run to output all the commands, so I could break it all down that way.

Based on my trees, do you think that would be a good approach for both my studies? Thanks for your help.

[image: Morex_Damai_distance] https://user-images.githubusercontent.com/8529807/88129226-24f50c80-cc0a-11ea-9acc-cb983b2abcec.png

[image: pangenome_tree_distance] https://user-images.githubusercontent.com/8529807/88129124-dd6e8080-cc09-11ea-8355-30044f644073.png

[image: pangenome_tree_bootstrap] https://user-images.githubusercontent.com/8529807/88129115-d47daf00-cc09-11ea-8917-76aea8df3b9d.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2899#issuecomment-662219535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIZWV65JIZXCENXX6TR4ZLXBANCNFSM4OWFOJUA .

brettChapman commented 4 years ago

From what I've read online seqwish doesn't actually perform the alignments, but processes the outputs of alignments. I've seen pipelines use minimap2 to create a .paf file, which seqwish then processes given the fasta file to produce a .gfa file. The seqwish parameters don't indicate any inputs for raw genomes/reads to align.

The reason I'm trying to align based on each branch and then merge the HAL files based on root and leaf, is because I can't align all to all with the memory resources I have available. I have a maximum of 64GB RAM on one node. I have 10 nodes in total, so I can run multiple alignments on each branch at the same time, and then prepare a job to merge the HAL files depending on completion of one of the alignments.

If you're referring to aligning all to all using seqwish/minimap2, my understanding of minimap2 is that it was a pairwise alignment tool, or is it capable of accepting multiple (concatenated?) genomes? Would each sequence need to be unique? At the moment I have the same sequence ID in each genome e.g. "chr1", would I need to relabel them e.g. "Morex_chr1"? Would this sequence renaming also need to be applied when processing with VG as well?