`vg msga` quite slow with 20 1Mb genomes

bredelings commented 4 years ago

1. What were you trying to do?

I have been using FSA to align about 20 de-novo-assembled genomes that are about 1MB long. They are from different individuals in the same species. FSA runs, but it takes tons of memory, and the resulting alignments have a lot of problems, especially near repeat regions.

An algorithm that used POA (or even better, a tree that was allowed to change across the sequence) would probably give a lot better results.

I just tried vg msga. My goal would be to dump the *.vg file back to FASTA after the graph is generated. However it seems to slow down rapidly in the number of sequences. The first few sequences took a few hours, but it has been stuck on the 11th sequence for about a two days.

2. What did you want to happen?

I expected to be able to align 20 sequences. I guess I expected roughly linear behavior in the number of genomes.

3. What actually happened?

The 11th genome has taken about two days.

Here is what the output for the last successfully-aligned genome looked like:

LCPArray::LCPArray(): Construction: 0.100241 seconds, 4.43544 GB
LCPArray::LCPArray(): 3722994 values at 5 levels (branching factor 64)
[vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0
ERR2679005-2x-LT635612: adding to graph 10/21
ERR2679005-2x-LT635612: aligning 1010808bp -> g:1696377bp n:160132 e:205746
ERR2679005-2x-LT635612: editing graph
ERR2679005-2x-LT635612: sorting and compacting ids
building xg index
building GCSA2 index
InputGraph::InputGraph(): 5922208 kmers in 1 file(s)
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
InputGraph::readKeys(): 3756478 unique keys
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
InputGraph::readFrom(): 3392933 unique start nodes
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
PathGraph::PathGraph(): 5922208 paths with 11844416 ranks
PathGraph::PathGraph(): 0.176496 GB in 1 file(s)
GCSA::GCSA(): Preprocessing: 1.1579 seconds, 4.94569 GB
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
PathGraph::prune(): 5922208 -> 3927014 paths (3394850 ranges)
PathGraph::prune(): 3243730 unique, 0 redundant, 359884 unsorted, 323400 nondeterministic paths
PathGraph::prune(): 0.117034 GB in 1 file(s)
PathGraph::read(): File 0: Read 3927014 order-16 paths
PathGraph::extend(): File 0: Created 4060742 order-32 paths
PathGraph::read(): File 0: Read 4060742 order-32 paths
PathGraphBuilder::sort(): File 0: Sorted 4060742 paths
PathGraph::extend(): 3927014 -> 4060742 paths (8615096 ranks)
PathGraph::extend(): 0.122858 GB in 1 file(s)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
PathGraph::prune(): 4060742 -> 4004742 paths (3584748 ranges)
PathGraph::prune(): 3491525 unique, 0 redundant, 130696 unsorted, 382521 nondeterministic paths
PathGraph::prune(): 0.120981 GB in 1 file(s)
PathGraph::read(): File 0: Read 4004742 order-32 paths
PathGraph::extend(): File 0: Created 4151065 order-64 paths
PathGraph::read(): File 0: Read 4151065 order-64 paths
PathGraphBuilder::sort(): File 0: Sorted 4151065 paths
PathGraph::extend(): 4004742 -> 4151065 paths (9367389 ranks)
PathGraph::extend(): 0.12768 GB in 1 file(s)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
PathGraph::prune(): 4151065 -> 4138989 paths (3666349 ranges)
PathGraph::prune(): 3556087 unique, 0 redundant, 135245 unsorted, 447657 nondeterministic paths
PathGraph::prune(): 0.1272 GB in 1 file(s)
PathGraph::read(): File 0: Read 4138989 order-64 paths
PathGraph::extend(): File 0: Created 9714161 order-128 paths
PathGraph::read(): File 0: Read 9714161 order-128 paths
PathGraphBuilder::sort(): File 0: Sorted 9714161 paths
PathGraph::extend(): 4138989 -> 9714161 paths (59219924 ranks)
PathGraph::extend(): 0.43774 GB in 1 file(s)
GCSA::GCSA(): Prefix-doubling: 16.9 seconds, 4.94569 GB
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): 3664810 paths with 8107893 ranks and 378171 additional start nodes
MergedGraph::MergedGraph(): 0.121167 GB
GCSA::GCSA(): Merging: 2.22692 seconds, 4.94569 GB
GCSA::GCSA(): Building the index
GCSA::GCSA(): Construction: 2.95495 seconds, 4.94569 GB
GCSA::GCSA(): 3664810 paths, 3753295 edges
GCSA::GCSA(): 4042981 pointers (650048 redundant)
GCSA::GCSA(): 633681 samples at 303585 positions
LCPArray::LCPArray(): Construction: 0.0936513 seconds, 4.94569 GB
LCPArray::LCPArray(): 3722983 values at 5 levels (branching factor 64)
[vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0
ERR2679008-2x-LT635612: adding to graph 11/21
ERR2679008-2x-LT635612: aligning 1010546bp -> g:1696458bp n:160273 e:210530

5. What data and command can the vg dev team use to make the problem happen?

vg msga -f 20-genomes.fasta -b PVP01-LT635612 -B 128    -D -t 6 > test.vg

6. What does running vg version say?

vg version v1.27.1 "Deliceto"
Compiled with g++ (Debian 10.2.0-13) 10.2.0 on Linux
Linked against libstd++ 20200930
Built by buildd@x86-csail-01

ekg commented 4 years ago

Please look at the pggb pipeline.

github.com/pangenome/pggb

It is designed to construct graphs from whole genomes. Your 20 1mb genomes would take a minute or so at most to build a graph.

vg msga is useful for very small cases, no more than a few hundred kbp, and not high depth. It's a testing harness and not maintained.

On Tue, Nov 10, 2020, 17:20 Benjamin Redelings notifications@github.com wrote:

1. What were you trying to do?

I have been using FSA to align about 20 de-novo-assembled genomes that are about 1MB long. They are from different individuals in the same species. FSA runs, but it takes tons of memory, and the resulting alignments have a lot of problems, especially near repeat regions.

An algorithm that used POA (or even better, a tree that was allowed to change across the sequence) would probably give a lot better results.

I just tried vg msga. My goal would be to dump the *.vg file back to FASTA after the graph is generated. However it seems to slow down rapidly in the number of sequences. The first few sequences took a few hours, but it has been stuck on the 11th sequence for about a two days.

2. What did you want to happen?

I expected to be able to align 20 sequences. I guess I expected roughly linear behavior in the number of genomes.

3. What actually happened?

The 11th genome has taken about two days.

Here is what the output for the last successfully-aligned genome looked like:

LCPArray::LCPArray(): Construction: 0.100241 seconds, 4.43544 GB LCPArray::LCPArray(): 3722994 values at 5 levels (branching factor 64) [vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0 ERR2679005-2x-LT635612: adding to graph 10/21 ERR2679005-2x-LT635612: aligning 1010808bp -> g:1696377bp n:160132 e:205746 ERR2679005-2x-LT635612: editing graph ERR2679005-2x-LT635612: sorting and compacting ids building xg index building GCSA2 index InputGraph::InputGraph(): 5922208 kmers in 1 file(s) InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 InputGraph::readKeys(): 3756478 unique keys InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 InputGraph::readFrom(): 3392933 unique start nodes InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 PathGraph::PathGraph(): 5922208 paths with 11844416 ranks PathGraph::PathGraph(): 0.176496 GB in 1 file(s) GCSA::GCSA(): Preprocessing: 1.1579 seconds, 4.94569 GB GCSA::GCSA(): Prefix-doubling from path length 16 GCSA::GCSA(): Step 1 (path length 16 -> 32) PathGraph::prune(): 5922208 -> 3927014 paths (3394850 ranges) PathGraph::prune(): 3243730 unique, 0 redundant, 359884 unsorted, 323400 nondeterministic paths PathGraph::prune(): 0.117034 GB in 1 file(s) PathGraph::read(): File 0: Read 3927014 order-16 paths PathGraph::extend(): File 0: Created 4060742 order-32 paths PathGraph::read(): File 0: Read 4060742 order-32 paths PathGraphBuilder::sort(): File 0: Sorted 4060742 paths PathGraph::extend(): 3927014 -> 4060742 paths (8615096 ranks) PathGraph::extend(): 0.122858 GB in 1 file(s) GCSA::GCSA(): Step 2 (path length 32 -> 64) PathGraph::prune(): 4060742 -> 4004742 paths (3584748 ranges) PathGraph::prune(): 3491525 unique, 0 redundant, 130696 unsorted, 382521 nondeterministic paths PathGraph::prune(): 0.120981 GB in 1 file(s) PathGraph::read(): File 0: Read 4004742 order-32 paths PathGraph::extend(): File 0: Created 4151065 order-64 paths PathGraph::read(): File 0: Read 4151065 order-64 paths PathGraphBuilder::sort(): File 0: Sorted 4151065 paths PathGraph::extend(): 4004742 -> 4151065 paths (9367389 ranks) PathGraph::extend(): 0.12768 GB in 1 file(s) GCSA::GCSA(): Step 3 (path length 64 -> 128) PathGraph::prune(): 4151065 -> 4138989 paths (3666349 ranges) PathGraph::prune(): 3556087 unique, 0 redundant, 135245 unsorted, 447657 nondeterministic paths PathGraph::prune(): 0.1272 GB in 1 file(s) PathGraph::read(): File 0: Read 4138989 order-64 paths PathGraph::extend(): File 0: Created 9714161 order-128 paths PathGraph::read(): File 0: Read 9714161 order-128 paths PathGraphBuilder::sort(): File 0: Sorted 9714161 paths PathGraph::extend(): 4138989 -> 9714161 paths (59219924 ranks) PathGraph::extend(): 0.43774 GB in 1 file(s) GCSA::GCSA(): Prefix-doubling: 16.9 seconds, 4.94569 GB GCSA::GCSA(): Merging the paths MergedGraph::MergedGraph(): 3664810 paths with 8107893 ranks and 378171 additional start nodes MergedGraph::MergedGraph(): 0.121167 GB GCSA::GCSA(): Merging: 2.22692 seconds, 4.94569 GB GCSA::GCSA(): Building the index GCSA::GCSA(): Construction: 2.95495 seconds, 4.94569 GB GCSA::GCSA(): 3664810 paths, 3753295 edges GCSA::GCSA(): 4042981 pointers (650048 redundant) GCSA::GCSA(): 633681 samples at 303585 positions LCPArray::LCPArray(): Construction: 0.0936513 seconds, 4.94569 GB LCPArray::LCPArray(): 3722983 values at 5 levels (branching factor 64) [vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0 ERR2679008-2x-LT635612: adding to graph 11/21 ERR2679008-2x-LT635612: aligning 1010546bp -> g:1696458bp n:160273 e:210530

5. What data and command can the vg dev team use to make the problem happen?

vg msga -f 20-genomes.fasta -b PVP01-LT635612 -B 128 -D -t 6 > test.vg

6. What does running vg version say?

vg version v1.27.1 "Deliceto" Compiled with g++ (Debian 10.2.0-13) 10.2.0 on Linux Linked against libstd++ 20200930 Built by buildd@x86-csail-01

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMJIMGD7M7SHXE5XYDSPFR3FANCNFSM4TQ3WGMQ .

ekg commented 4 years ago

Sorry I realize I told you this twice!

On Tue, Nov 10, 2020, 18:57 Erik Garrison erik.garrison@gmail.com wrote:

Please look at the pggb pipeline.

github.com/pangenome/pggb

It is designed to construct graphs from whole genomes. Your 20 1mb genomes would take a minute or so at most to build a graph.

vg msga is useful for very small cases, no more than a few hundred kbp, and not high depth. It's a testing harness and not maintained.

On Tue, Nov 10, 2020, 17:20 Benjamin Redelings notifications@github.com wrote:

1. What were you trying to do?

I have been using FSA to align about 20 de-novo-assembled genomes that are about 1MB long. They are from different individuals in the same species. FSA runs, but it takes tons of memory, and the resulting alignments have a lot of problems, especially near repeat regions.

An algorithm that used POA (or even better, a tree that was allowed to change across the sequence) would probably give a lot better results.

I just tried vg msga. My goal would be to dump the *.vg file back to FASTA after the graph is generated. However it seems to slow down rapidly in the number of sequences. The first few sequences took a few hours, but it has been stuck on the 11th sequence for about a two days.

2. What did you want to happen?

I expected to be able to align 20 sequences. I guess I expected roughly linear behavior in the number of genomes.

3. What actually happened?

The 11th genome has taken about two days.

Here is what the output for the last successfully-aligned genome looked like:

LCPArray::LCPArray(): Construction: 0.100241 seconds, 4.43544 GB LCPArray::LCPArray(): 3722994 values at 5 levels (branching factor 64) [vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0 ERR2679005-2x-LT635612: adding to graph 10/21 ERR2679005-2x-LT635612: aligning 1010808bp -> g:1696377bp n:160132 e:205746 ERR2679005-2x-LT635612: editing graph ERR2679005-2x-LT635612: sorting and compacting ids building xg index building GCSA2 index InputGraph::InputGraph(): 5922208 kmers in 1 file(s) InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 InputGraph::readKeys(): 3756478 unique keys InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 InputGraph::readFrom(): 3392933 unique start nodes InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0 PathGraph::PathGraph(): 5922208 paths with 11844416 ranks PathGraph::PathGraph(): 0.176496 GB in 1 file(s) GCSA::GCSA(): Preprocessing: 1.1579 seconds, 4.94569 GB GCSA::GCSA(): Prefix-doubling from path length 16 GCSA::GCSA(): Step 1 (path length 16 -> 32) PathGraph::prune(): 5922208 -> 3927014 paths (3394850 ranges) PathGraph::prune(): 3243730 unique, 0 redundant, 359884 unsorted, 323400 nondeterministic paths PathGraph::prune(): 0.117034 GB in 1 file(s) PathGraph::read(): File 0: Read 3927014 order-16 paths PathGraph::extend(): File 0: Created 4060742 order-32 paths PathGraph::read(): File 0: Read 4060742 order-32 paths PathGraphBuilder::sort(): File 0: Sorted 4060742 paths PathGraph::extend(): 3927014 -> 4060742 paths (8615096 ranks) PathGraph::extend(): 0.122858 GB in 1 file(s) GCSA::GCSA(): Step 2 (path length 32 -> 64) PathGraph::prune(): 4060742 -> 4004742 paths (3584748 ranges) PathGraph::prune(): 3491525 unique, 0 redundant, 130696 unsorted, 382521 nondeterministic paths PathGraph::prune(): 0.120981 GB in 1 file(s) PathGraph::read(): File 0: Read 4004742 order-32 paths PathGraph::extend(): File 0: Created 4151065 order-64 paths PathGraph::read(): File 0: Read 4151065 order-64 paths PathGraphBuilder::sort(): File 0: Sorted 4151065 paths PathGraph::extend(): 4004742 -> 4151065 paths (9367389 ranks) PathGraph::extend(): 0.12768 GB in 1 file(s) GCSA::GCSA(): Step 3 (path length 64 -> 128) PathGraph::prune(): 4151065 -> 4138989 paths (3666349 ranges) PathGraph::prune(): 3556087 unique, 0 redundant, 135245 unsorted, 447657 nondeterministic paths PathGraph::prune(): 0.1272 GB in 1 file(s) PathGraph::read(): File 0: Read 4138989 order-64 paths PathGraph::extend(): File 0: Created 9714161 order-128 paths PathGraph::read(): File 0: Read 9714161 order-128 paths PathGraphBuilder::sort(): File 0: Sorted 9714161 paths PathGraph::extend(): 4138989 -> 9714161 paths (59219924 ranks) PathGraph::extend(): 0.43774 GB in 1 file(s) GCSA::GCSA(): Prefix-doubling: 16.9 seconds, 4.94569 GB GCSA::GCSA(): Merging the paths MergedGraph::MergedGraph(): 3664810 paths with 8107893 ranks and 378171 additional start nodes MergedGraph::MergedGraph(): 0.121167 GB GCSA::GCSA(): Merging: 2.22692 seconds, 4.94569 GB GCSA::GCSA(): Building the index GCSA::GCSA(): Construction: 2.95495 seconds, 4.94569 GB GCSA::GCSA(): 3664810 paths, 3753295 edges GCSA::GCSA(): 4042981 pointers (650048 redundant) GCSA::GCSA(): 633681 samples at 303585 positions LCPArray::LCPArray(): Construction: 0.0936513 seconds, 4.94569 GB LCPArray::LCPArray(): 3722983 values at 5 levels (branching factor 64) [vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0 ERR2679008-2x-LT635612: adding to graph 11/21 ERR2679008-2x-LT635612: aligning 1010546bp -> g:1696458bp n:160273 e:210530

5. What data and command can the vg dev team use to make the problem happen?

vg msga -f 20-genomes.fasta -b PVP01-LT635612 -B 128 -D -t 6 > test.vg

6. What does running vg version say?

vg version v1.27.1 "Deliceto" Compiled with g++ (Debian 10.2.0-13) 10.2.0 on Linux Linked against libstd++ 20200930 Built by buildd@x86-csail-01

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMJIMGD7M7SHXE5XYDSPFR3FANCNFSM4TQ3WGMQ .

vgteam / vg

`vg msga` quite slow with 20 1Mb genomes #3092