vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

VG with very large chromosomes #3328

Open agolicz opened 3 years ago

agolicz commented 3 years ago

Hi, I apologize in advance for the lack of code examples but I have to estimate computational resources before starting the project and I was hoping to get some guidance. We are working on a complex plant genome (13Gb split across 6 chromosomes of similar size). We will have 8 genome assemblies and short read data for 10 other accessions. I am planning to make per chromosome alignments using Cactus, convert those to graphs and then merge the per chromosome graphs for downstream analysis (possibly use Cactus pangenome pipeline https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md) The downstream analyses will include variant genotyping from short reads (possible with augmentation prior to genotyping, but no idea if that will work) and Iso-Seq mapping. We have a node with 500 GB RAM. Do you think that would be a feasible plan? Will I have enough memory for downstream analysis with a graph potentially containing ~20-25Gb and chromosomes of about 2-3Gb? All the best, Agnieszka

ekg commented 3 years ago

I'm curious if we could run pggb on this. The memory should be fine (we do 100X human chr1 in ~128G of RAM). It would need to be broken up by chromosome.

Would love to try out a build. Is the data public?

agolicz commented 3 years ago

Oh pggb looks very nice, I did not know about it! Unfortunately, I don't have the data yet. But I would be happy to share when it arrives (although it most likely won't be public for a while). I can discuss with the team. For vg map/giraffe/mpmap do you think I could use the whole graph or will I have to split by chromosome? I assume splitting by chromosome would affect the results...

jltsiren commented 3 years ago

Index construction is the most memory intensive part of read mapping. You can probably build whole-genome indexes for Giraffe in 500 GB. Map and mpmap are more difficult, because the memory usage of GCSA construction depends on the size of individual chromosomes.

With a graph of that size, you may be getting close to various size limits. There are some memory optimizations that limit the number of nodes to ~2^31 and the length of individual contigs to ~2^32.

agolicz commented 3 years ago

Thanks, I was afraid that may be the case. In case of hitting chromosome size limits I suppose I could split up chromosomes at the same location across assemblies. To reduce the number of nodes one option would be to build a graph with minigraph (should only capture larger variants) and try use that. There is no official tools for minigraph rGFA to vg GFA right? Plants can be very mean with chromosome sizes... Thanks again, this was very helpful. If you have any other suggestions I would be very grateful.

All the best, Agnieszka

jltsiren commented 3 years ago

The number of nodes largely depends on rare variants. For example, a whole-genome human graph based on ~100 aligned haplotypes typically has ~100 million nodes. A graph based on 1000GP VCFs has ~300 million nodes, and a graph based on TOPMed VCFs would probably have >2 billion nodes. If your graph is broadly similar to the first one, the 20-25 Gb estimate should guarantee that you stay within the node limit.

The contig length limit is technically an upper bound for shortest distances inside graph components. If graph topology is reasonable, it should be possible to handle up to ~4 Gb chromosomes. In some weirder graphs, components may be much longer than the chromosomes they are based on.

agolicz commented 3 years ago

Many thanks again. I am sure I will be back when the data arrives!

All the best, Agnieszka

agolicz commented 2 years ago

Hello, As promised I return :), now that we have some data. We have assemblies for two genomes and I tried to build a graph with pggb for one of the chromosomes (about 1.8 Gb in length, mostly repeats). I have a 1Tb memory VM but it looks like I've run out of memory after about 24hrs.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 300000 -k 511 -n 2 -G 23117,23219 -v
[wfmash::map] Reference = [input.fa]
[wfmash::map] Query = [input.fa]
[wfmash::map] Kmer size = 52
[wfmash::map] Window size = 256
[wfmash::map] Segment length = 300000 (read split allowed)
[wfmash::map] Block length min = 1500000
[wfmash::map] Chaining gap max = 30000000
[wfmash::map] Percentage identity threshold = 95%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 40
[wfmash::skch::Sketch::build] minimizers picked from reference = 28551821
[wfmash::skch::Sketch::index] unique minimizers = 9646834
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup.
[wfmash::map] time spent computing the reference index: 127.962 sec
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.02e+07 bp/s elapsed: 00:00:03:01 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192
[wfmash::map] time spent mapping the query: 1.81e+02 sec
[wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::align] Reference = [input.fa]
[wfmash::align] Query = [input.fa]
[wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::align] Alignment identity cutoff = 7.60e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent loading the reference index: 7.86e-03 sec
[wfmash::align::computeAlignments] aligned 57.22% @ 4.45e+04 bp/s elapsed: 00:05:42:33 remain: 00:04:16:04Command terminated by signal 9
wfmash -X -s 300000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa
4545239.25s user 21913.28s system 2375% cpu 192223.15s total 1050026284Kb max memory

@ekg do you perhaps have any suggestion to improve parameter choice? The assemblies are not yet public, my fault I keep playing around with scaffolding for one of them but I would be happy to share what we have for testing.

Agnieszka

ekg commented 2 years ago

Hi!

I would first suggest using a new version of pggb (it's just a few weeks old, if you haven't updated lately). Running out of memory in wfmash is surprising, but it also seems you a lee pushing limits (in terms of contig size) that we haven't tested. Would you please cross post this to the pggb issues?

-Erik

On Sun, Apr 17, 2022, 13:47 Agnieszka Golicz @.***> wrote:

Hello, As promised I return :), now that we have some data. We have assemblies for two genomes and I tried to build a graph with pggb for one of the chromosomes (about 1.8 Gb in length, mostly repeats). I have a 1Tb memory VM but it looks like I've run out of memory after about 24hrs.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 300000 -k 511 -n 2 -G 23117,23219 -v [wfmash::map] Reference = [input.fa] [wfmash::map] Query = [input.fa] [wfmash::map] Kmer size = 52 [wfmash::map] Window size = 256 [wfmash::map] Segment length = 300000 (read split allowed) [wfmash::map] Block length min = 1500000 [wfmash::map] Chaining gap max = 30000000 [wfmash::map] Percentage identity threshold = 95% [wfmash::map] Skip self mappings [wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) [wfmash::map] Execution threads = 40 [wfmash::skch::Sketch::build] minimizers picked from reference = 28551821 [wfmash::skch::Sketch::index] unique minimizers = 9646834 [wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1) [wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup. [wfmash::map] time spent computing the reference index: 127.962 sec [wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.02e+07 bp/s elapsed: 00:00:03:01 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192 [wfmash::map] time spent mapping the query: 1.81e+02 sec [wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::align] Reference = [input.fa] [wfmash::align] Query = [input.fa] [wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::align] Alignment identity cutoff = 7.60e-01% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 7.86e-03 sec [wfmash::align::computeAlignments] aligned 57.22% @ 4.45e+04 bp/s elapsed: 00:05:42:33 remain: 00:04:16:04Command terminated by signal 9 wfmash -X -s 300000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa 4545239.25s user 21913.28s system 2375% cpu 192223.15s total 1050026284Kb max memory

@ekg https://github.com/ekg do you perhaps have any suggestion to improve parameter choice? The assemblies are not yet public, my fault I keep playing around with scaffolding for one of them but I would be happy to share what we have for testing.

Agnieszka

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3328#issuecomment-1100859990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEO7KYE4MQ5BXBXPR7LVFP25RANCNFSM47GBTZAQ . You are receiving this because you were mentioned.Message ID: @.***>