pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
389 stars 44 forks source link

Pggb with gigabase-sized contigs - out of memory #187

Open agolicz opened 2 years ago

agolicz commented 2 years ago

This is an issue related to vgteam/vg#3328. We have assemblies for two genomes and I tried to build a graph with pggb for one of the chromosomes (about 1.8 Gb in length, mostly repeats). I have a 1Tb memory VM but it looks like I've run out of memory after about 24hrs. I've been using the newest version of pggb, installed three days ago.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 300000 -k 511 -n 2 -G 23117,23219 -v
[wfmash::map] Reference = [input.fa]
[wfmash::map] Query = [input.fa]
[wfmash::map] Kmer size = 52
[wfmash::map] Window size = 256
[wfmash::map] Segment length = 300000 (read split allowed)
[wfmash::map] Block length min = 1500000
[wfmash::map] Chaining gap max = 30000000
[wfmash::map] Percentage identity threshold = 95%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 40
[wfmash::skch::Sketch::build] minimizers picked from reference = 28551821
[wfmash::skch::Sketch::index] unique minimizers = 9646834
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup.
[wfmash::map] time spent computing the reference index: 127.962 sec
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.02e+07 bp/s elapsed: 00:00:03:01 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192
[wfmash::map] time spent mapping the query: 1.81e+02 sec
[wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::align] Reference = [input.fa]
[wfmash::align] Query = [input.fa]
[wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq
[wfmash::align] Alignment identity cutoff = 7.60e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent loading the reference index: 7.86e-03 sec
[wfmash::align::computeAlignments] aligned 57.22% @ 4.45e+04 bp/s elapsed: 00:05:42:33 remain: 00:04:16:04Command terminated by signal 9
wfmash -X -s 300000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa
4545239.25s user 21913.28s system 2375% cpu 192223.15s total 1050026284Kb max memory

Any suggestions of what could be done to reduce memory use?

AndreaGuarracino commented 2 years ago

Such a high memory consumption is a bit unexpected, but I see you are requiring (with -s 300k) big homologous regions that are probably spanning big structural variants. To reduce the memory usage, you can decrease the segment length to, for example, -s 50k or -s 100k. This will lead to easier alignments.

ekg commented 2 years ago

With the current wfmash, even -s 100k is likely to generate very difficult alignments. I wouldn't go above -s 50k.

On Mon, Apr 18, 2022, 11:22 Andrea Guarracino @.***> wrote:

Such a high memory consumption is a bit unexpected, but I see you are requiring big homologous regions (-s 300k) that are probably spanning big structural variants. To reduce the memory usage, you can decrease the segment length to, for example, -s 50k or -s 100k. This will lead to easier alignments.

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1101254161, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENWKLRA3PEYEHYTWPTVFUSUVANCNFSM5TT4Q3WQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

agolicz commented 2 years ago

Many thanks! I've restarted the run with -s 50k. I'll report back how it goes.

agolicz commented 2 years ago

Just wanted to update on this. It does not look like using -s 50k made any difference to memory usage.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 50000 -k 511 -n 2 -G 23117,23219 -v [wfmash::map] Reference = [input.fa] [wfmash::map] Query = [input.fa] [wfmash::map] Kmer size = 52 [wfmash::map] Window size = 256 [wfmash::map] Segment length = 50000 (read split allowed) [wfmash::map] Block length min = 250000 [wfmash::map] Chaining gap max = 5000000 [wfmash::map] Percentage identity threshold = 95% [wfmash::map] Skip self mappings [wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) [wfmash::map] Execution threads = 40 [wfmash::skch::Sketch::build] minimizers picked from reference = 28551821 [wfmash::skch::Sketch::index] unique minimizers = 9646834 [wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1) [wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup. [wfmash::map] time spent computing the reference index: 144.249 sec [wfmash::skch::Map::mapQuery] mapped 100.00% @ 1.72e+07 bp/s elapsed: 00:00:03:32 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192 [wfmash::map] time spent mapping the query: 2.13e+02 sec [wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Reference = [input.fa] [wfmash::align] Query = [input.fa] [wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Alignment identity cutoff = 7.60e-01% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 2.28e-02 sec [wfmash::align::computeAlignments] aligned 0.31% @ 7.45e+01 bp/s elapsed: 00:13:14:03 remain: 177:14:55:19Command terminated by signal 9 wfmash -X -s 50000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa 2908001.88s user 35337.69s system 3991% cpu 73734.58s total 1050005520Kb max memory

I am technically on leave right now, but when I'm back next week I'll try to split the chromosomes into smaller chunks using gene synteny information to see if that helps. The estimated runtime also looks very high (177:14:55:19), maybe it's something about the repeats that is causing the issue...

ekg commented 2 years ago

How many genomes do you have? Wfmash -n 2 will over map in the case of two genomes. That could result in a hard alignment.

I would push up the -p setting, and down the -s. Please try -s 10k -p 98.

On Wed, Apr 20, 2022, 13:33 Agnieszka Golicz @.***> wrote:

Just wanted to update on this. It does not look like using -s 50k made any difference to memory usage.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 50000 -k 511 -n 2 -G 23117,23219 -v [wfmash::map] Reference = [input.fa] [wfmash::map] Query = [input.fa] [wfmash::map] Kmer size = 52 [wfmash::map] Window size = 256 [wfmash::map] Segment length = 50000 (read split allowed) [wfmash::map] Block length min = 250000 [wfmash::map] Chaining gap max = 5000000 [wfmash::map] Percentage identity threshold = 95% [wfmash::map] Skip self mappings [wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) [wfmash::map] Execution threads = 40 [wfmash::skch::Sketch::build] minimizers picked from reference = 28551821 [wfmash::skch::Sketch::index] unique minimizers = 9646834 [wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1) [wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup. [wfmash::map] time spent computing the reference index: 144.249 sec [wfmash::skch::Map::mapQuery] mapped 100.00% @ 1.72e+07 bp/s elapsed: 00:00:03:32 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192 [wfmash::map] time spent mapping the query: 2.13e+02 sec [wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Reference = [input.fa] [wfmash::align] Query = [input.fa] [wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Alignment identity cutoff = 7.60e-01% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 2.28e-02 sec [wfmash::align::computeAlignments] aligned 0.31% @ 7.45e+01 bp/s elapsed: 00:13:14:03 remain: 177:14:55:19Command terminated by signal 9 wfmash -X -s 50000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa 2908001.88s user 35337.69s system 3991% cpu 73734.58s total 1050005520Kb max memory

I am technically on leave right now, but when I'm back next week I'll try to split the chromosomes into smaller chunks using gene synteny information to see if that helps. The estimated runtime also looks very high (177:14:55:19), maybe it's something about the repeats that is causing the issue...

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1103827919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJVHMOWUC2JH23FUKTVF7TRFANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

ekg commented 2 years ago

Would it be possible to share the input genomes so we can test?

On Wed, Apr 20, 2022, 13:44 Erik Garrison @.***> wrote:

How many genomes do you have? Wfmash -n 2 will over map in the case of two genomes. That could result in a hard alignment.

I would push up the -p setting, and down the -s. Please try -s 10k -p 98.

On Wed, Apr 20, 2022, 13:33 Agnieszka Golicz @.***> wrote:

Just wanted to update on this. It does not look like using -s 50k made any difference to memory usage.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 50000 -k 511 -n 2 -G 23117,23219 -v [wfmash::map] Reference = [input.fa] [wfmash::map] Query = [input.fa] [wfmash::map] Kmer size = 52 [wfmash::map] Window size = 256 [wfmash::map] Segment length = 50000 (read split allowed) [wfmash::map] Block length min = 250000 [wfmash::map] Chaining gap max = 5000000 [wfmash::map] Percentage identity threshold = 95% [wfmash::map] Skip self mappings [wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) [wfmash::map] Execution threads = 40 [wfmash::skch::Sketch::build] minimizers picked from reference = 28551821 [wfmash::skch::Sketch::index] unique minimizers = 9646834 [wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1) [wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup. [wfmash::map] time spent computing the reference index: 144.249 sec [wfmash::skch::Map::mapQuery] mapped 100.00% @ 1.72e+07 bp/s elapsed: 00:00:03:32 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192 [wfmash::map] time spent mapping the query: 2.13e+02 sec [wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Reference = [input.fa] [wfmash::align] Query = [input.fa] [wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-MUw5oL [wfmash::align] Alignment identity cutoff = 7.60e-01% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 2.28e-02 sec [wfmash::align::computeAlignments] aligned 0.31% @ 7.45e+01 bp/s elapsed: 00:13:14:03 remain: 177:14:55:19Command terminated by signal 9 wfmash -X -s 50000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa 2908001.88s user 35337.69s system 3991% cpu 73734.58s total 1050005520Kb max memory

I am technically on leave right now, but when I'm back next week I'll try to split the chromosomes into smaller chunks using gene synteny information to see if that helps. The estimated runtime also looks very high (177:14:55:19), maybe it's something about the repeats that is causing the issue...

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1103827919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJVHMOWUC2JH23FUKTVF7TRFANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

agolicz commented 2 years ago

I tried -n1 -s10k -p98, but still run into memory issues.

 pggb -i input.fa -o output -t 40 -K 52 -p 98 -s 10k -k 511 -n 1 -G 23117,23219 -v
[wfmash::map] Reference = [input.fa]
[wfmash::map] Query = [input.fa]
[wfmash::map] Kmer size = 52
[wfmash::map] Window size = 256
[wfmash::map] Segment length = 10000 (read split allowed)
[wfmash::map] Block length min = 50000
[wfmash::map] Chaining gap max = 1000000
[wfmash::map] Percentage identity threshold = 98%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-XRru4j
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 40
[wfmash::skch::Sketch::build] minimizers picked from reference = 28551821
[wfmash::skch::Sketch::index] unique minimizers = 9646834
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup.
[wfmash::map] time spent computing the reference index: 137.703 sec
[wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.18e+07 bp/s elapsed: 00:00:02:47 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192
[wfmash::map] time spent mapping the query: 1.68e+02 sec
[wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-XRru4j
[wfmash::align] Reference = [input.fa]
[wfmash::align] Query = [input.fa]
[wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-XRru4j
[wfmash::align] Alignment identity cutoff = 7.84e-01%
[wfmash::align] Alignment output file = /dev/stdout
[wfmash::align] time spent loading the reference index: 2.38e-02 sec
[wfmash::align::computeAlignments] aligned  1.60% @ 2.02e+02 bp/s elapsed: 00:10:31:17 remain: 26:21:26:31Command terminated by signal 9
wfmash -X -s 10k -k 52 -p 98 -n 1 -t 40 input.fa input.fa
1852863.33s user 27801.48s system 3981% cpu 47235.01s total 1049953068Kb max memory

One of the assemblies can be downloaded here: https://projects.au.dk/fabagenome/genomics-data

I emailed a download link to the other one to the UoT email address. Please let me know if there are any issues with accessing data.

ekg commented 2 years ago

It looks like we solved this problem. It was a code level bug that only occurred with very large chromosomes. You can test by getting the current development version of wfmash, but we will probably release soon.

agolicz commented 2 years ago

Many thanks for looking into this and the fast response! I got the dev version and did a run with one of the chromosomes with these parameters: pggb -i input.fa -o output -t 40 -K 52 -p 98 -s 50k -k 511 -n 1 -G 23117,23219 -v

Seems to have completed fine. Lightening fast too! I'll have a proper look at the graph when I'm back in the office next week.

Is -n 1 sufficient for two genomes, also to detect duplications?

ekg commented 2 years ago

In the current master of pggb you need to specify -n 2 for two genomes, and also going forward. Sorry for the confusion!

On Sun, Apr 24, 2022, 13:02 Agnieszka Golicz @.***> wrote:

Many thanks for looking into this and the fast response! I got the dev version and did a run with one of the chromosomes with these parameters: pggb -i input.fa -o output -t 40 -K 52 -p 98 -s 50k -k 511 -n 1 -G 23117,23219 -v

Seems to have completed fine. Lightening fast too! I'll have a proper look at the graph when I'm back in the office next week.

Is -n 1 sufficient for two genomes, also to detect duplications?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1107817212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIAKL6WVFHLSL2QKF3VGUS5BANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

ekg commented 2 years ago

Are you using the current master of pggb?

If so, you'll need to set -n 2 for this problem.

Curious if it works for you. Did you run into a problem with the result?

ekg commented 2 years ago

Also, -K 52 -k 511 is unlikely to work correctly. I would suggest staying close to defaults with: pggb -s 100k -p 98 -n 2

agolicz commented 2 years ago

Yes, I'm using the master, I started the run with pggb -s 100k -p 98 -n 2. Will report back how it goes. Trying on a single chromosome for now.

agolicz commented 2 years ago

I started playing around with -K when got the out of memory but now back to defaults.

agolicz commented 2 years ago

Just finished a run for the two genomes and all chromosomes at the same time (pggb -i input.fa -o output -t 40 -p 98 -s 100k -n 2 -G 23117,23219 -v). It took ~24hrs, max mem consumption 138Gb. Just having a quick looks at the fix.gfa

cut -f 1 input.fa.48fecf0.7bdde5a.bc2f2cb.smooth.fix.gfa | sort | uniq -c
      1 H
13364598 L
   5957 P
9873486 S

So I think it went ok, now will need to have a proper look at the gfa and look at some known variants, but I will probably need some time before I can get to it.

agolicz commented 2 years ago

In the end we will have 10 HiFi assemblies and 20*30x Illumina for genotyping, so it should be a fun project. We will have a PhD student focusing on this work, but I'll also try to help out as much as I can. I've been looking forward to doing some pangenome graph work for a while now and am super excited about the newest developments!

ekg commented 2 years ago

Great! Glad it worked to completion. You can get information about known variants from VG deconstruct (VCF) or through odgi extract and targeted analysis.

You may want to apply vcfbub and vcflib vcfallelicprimitives to decompose big variants. We'll be incorporating this into pggb in the coming weeks.

On Wed, Apr 27, 2022, 08:44 Agnieszka Golicz @.***> wrote:

In the end we will have 10 HiFi assemblies and 20*30x Illumina for genotyping, so it should be a fun project. We will have a PhD student focusing on this work, but I'll also try to help out as much as I can. I've been looking forward to doing some pangenome graph work for a while now and am super excited about the newest developments!

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1110606389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEK6Y5WJB7QIKQRZFKDVHDO6TANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

agolicz commented 2 years ago

I just had a look at some more stats and the length of the graph worries me a bit:

vg stats -zlAr input.fa.48fecf0.7bdde5a.bc2f2cb.smooth.fix.gfa

nodes   9873486
edges   13364598
length  22218858029
node-id-range   1:9873486
cyclic

The individual assemblies are 11-12Gb (input.fa length 23281212711) and since it's a plant I would expect maybe 10-20% extra sequence between accessions, so the graph seems too long... (that's for -p 98, I am trying -p 95 but it's still running).

ekg commented 2 years ago

It looks very underaligned to me, especially based on the wfmash tests we did.

What command line are you using? Did you put all chromosomes together (that should work here).

Can you confirm that the alignment is working for a single chromosome pair? Just plotting a PAF should make it clear what's happening.

On Mon, May 2, 2022, 08:25 Agnieszka Golicz @.***> wrote:

I just had a look at some more stats and the length of the graph worries me a bit:

vg stats -zlAr input.fa.48fecf0.7bdde5a.bc2f2cb.smooth.fix.gfa

nodes 9873486 edges 13364598 length 22218858029 node-id-range 1:9873486 cyclic

The individual assemblies are 11-12Gb (input.fa length 23281212711) and since it's a plant I would expect maybe 10-20% extra sequence between accessions, so the graph seems too long... (that's for -p 98, I am trying -p 95 but it's still running).

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1114537167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELECAJJOXS7YWUPEQLVH5YMLANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

agolicz commented 2 years ago

Yes, you are right, just looking at the size of paf files. No errors in the log (that I could see...) pggb -i input.fa -o output -t 40 -p 98 -s 100k -n 2 -G 23117,23219 -v Yes, I put all the chromosomes together, but I also have a test on one chromosomes (1L, plot attached).

If you saw good wfmash alignments that's good news. Could you let me know what parameters were you using? chr1L

ekg commented 2 years ago

The plot gives me the impression that there is a lot of drop out in the alignment. Might be happening at the mapping step. It's worth noting that -p 98 -s 100k in the current wfmash version is more stringent than we used for humans, which have 1/1000 pairwise diversity.

There is also the possibility of failure in alignment. I have these genomes for testing, I'll see what works in terms of settings.

If you want to check things out in parallel, what happens with the default wfmash parameters and this particular chromosome pair?

On Mon, May 2, 2022, 13:58 Agnieszka Golicz @.***> wrote:

Yes, you are right, just looking at the size of paf files. No errors in the log (that I could see...) pggb -i input.fa -o output -t 40 -p 98 -s 100k -n 2 -G 23117,23219 -v Yes, I put all the chromosomes together, but I also have a test on one chromosomes (1L, plot attached).

If you saw good wfmash alignments that's good news. Could you let me know what parameters were you using? [image: chr1L] https://user-images.githubusercontent.com/10213846/166229943-fe543873-6ebc-486d-8761-a020634e5e04.png

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1114765669, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEP3ZWNIJWQHI3LEQA3VH67M7ANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

agolicz commented 2 years ago

I'll also keep trying on my end. I was waiting for a second run with all chromosomes to finish (pggb -i input.fa -o output -t 40 -p 95 -s 100k -n 2 -G 23117,23219 -v). But I think wfmash got stuck at 98% aligned. The paf file stopped being updated on 01.05 at 6 am and I just killed it now, two days later.

Now I am trying one chromosome on defaults.

agolicz commented 2 years ago

The parameters for default run on one chromosome: pggb -i input.fa -o outputdef -t 40 -n 2 -v

ekg commented 2 years ago

I suspect that -s 100k is just too large. I haven't been testing that length with the current version.

If the segment is too big, it will make the alignment problem harder in time and memory. The shorter segments help to give the end of the mappings and reduce alignment score (which in WFA ~ costs).

On Tue, May 3, 2022, 08:29 Agnieszka Golicz @.***> wrote:

The parameters for default run on one chromosome: pggb -i input.fa -o outputdef -t 40 -n 2 -v

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187#issuecomment-1115778335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJ7L2WH2WV6MZWKQZLVIDBSZANCNFSM5TT4Q3WQ . You are receiving this because you commented.Message ID: @.***>

ekg commented 2 years ago

I'm able to reproduce the slowness with the last few alignments. However, I've only been running the alignment for about an hour. It seems possible they will complete soon.

i=1.a; sbatch -c 48 -q workers --wrap '/gnu/store/h4i305b8fam9g0iy4vnjapnwdzgp4lw8-wfmash-0.8.1+bc4b32c-4/bin/wfmash -t 48 211010_Vfaba_split_pseudomolecules_hedin_v1pluschlro_clean_unanchored_contigs.fasta.gz ragtag.scaffold.fmt.fasta.gz >vfaba.'$i'.paf'

This is at about 99.8% complete but slowly processing the last alignments. Here's a pafplot -s 2000 of the output so far. It looks relatively complete to me.

image

Some hypotheses. Some difficult alignments that are taking a long time. With -s 100k the gaps that can be spanned by the mapping are extremely large (10Mbp) and this may be leading to a failure in the alignment. However, we're seeing stragglers even with the default settings -p 98 -s 10k, where a 1Mbp chaining gap is allowed.

There are several potential solutions. One is to set a maximum score on the alignment process, to prevent it from running endlessly when the pair of sequences don't have good homology. Another is to increase the stringency of a filter on the allowed divergence in lengths allowed in a mapping, to be less permissive of these mappings. I'll find the alignment pairs that are slow and check this out.

But, to not get stuck, an easy solution may be to reduce these chaining gap thresholds. These can be adjusted, for instance wfmash -c 100k sets the chaining gap to 100k. You can run a custom alignment setup and feed it into pggb with pggb -i. The rest of the process should proceed as before.

ekg commented 2 years ago

These notes may be useful for later, or other users who run into this problem, but, the process finished about 30 minutes after I wrote them!

image

Would you try running pggb with default parameters here? wfmash -p 95 -s 10k is default, and the same parameters are exposed on pggb's command line (and are default there).

ekg commented 2 years ago

Excuse me! While this alignment works quickly enough, the rendering is masking an underalignment issue!

-> % f=vfaba.1.a.paf ; <$f | sed s/gi:f:// | awk '{ sum += $13 * $10; tot += $10; block += $11; } END { print "matches",tot; print "block",block; print "gap.id", sum/tot; print "block.id",tot/block * 100; }' 
matches 3227346324
block 3615773214
gap.id 99.1093
block.id 89.2574

This is only ~1/3 of the 11G genome.

Investigating further. Which chromosome do you suggest working with?

agolicz commented 2 years ago

I suggest working with 1L (that's the one I am using).

agolicz commented 2 years ago

My run on defaults (wfmash -p 95 -s 10k) finished today and resulted in the same outcome. Under-mapping and the graph almost twice the size (input.fa - 3658449192).

stats -zlAr input.fa.6771046.7bdde5a.6abc6ee.smooth.fix.gfa

nodes 15021423 edges 20215250 length 3050728404 node-id-range 1:15021423 cyclic

ekg commented 2 years ago

Just checking, but 3,658,449,192 vs. 3,050,728,404 isn't 2x as large.

agolicz commented 2 years ago

Sorry should have been clearer. One chromosome =~1.8 Gb, 2 chromosomes (input.fa) = 3658449192, so the graph (3050728404) is still close to 2x1.8. At least closer than it should be, consistent with 20-30% alignment rate (I think!).

ekg commented 2 years ago

Oh right! Sorry for my confusion!

ekg commented 2 years ago

I would suggest using a shorter wfmash segment length. This will significantly reduce the complexity of the alignment. The current default (-s 10k) works well in even very large genomes. You may decide to increase this, but 300k seems like it might start to cause problems.

Also, very recently (not sure if released yet, but within the last few days) we fixed a bug related to the -n parameter that would have caused over mapping, with worst effects for N=2, resulting in some extremely difficult alignments. What git commit are you using?

On Sun, Apr 17, 2022, 15:37 Agnieszka Golicz @.***> wrote:

This is an issue related to vgteam/vg#3328 https://github.com/vgteam/vg/issues/3328. We have assemblies for two genomes and I tried to build a graph with pggb for one of the chromosomes (about 1.8 Gb in length, mostly repeats). I have a 1Tb memory VM but it looks like I've run out of memory after about 24hrs. I've been using the newest version of pggb, installed three days ago.

pggb -i input.fa -o output -t 40 -K 52 -p 95 -s 300000 -k 511 -n 2 -G 23117,23219 -v [wfmash::map] Reference = [input.fa] [wfmash::map] Query = [input.fa] [wfmash::map] Kmer size = 52 [wfmash::map] Window size = 256 [wfmash::map] Segment length = 300000 (read split allowed) [wfmash::map] Block length min = 1500000 [wfmash::map] Chaining gap max = 30000000 [wfmash::map] Percentage identity threshold = 95% [wfmash::map] Skip self mappings [wfmash::map] Mapping output file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none) [wfmash::map] Execution threads = 40 [wfmash::skch::Sketch::build] minimizers picked from reference = 28551821 [wfmash::skch::Sketch::index] unique minimizers = 9646834 [wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 4333751) ... (14167, 1) [wfmash::skch::Sketch::computeFreqHist] With threshold 0.5%, ignore minimizers occurring >= 36 times during lookup. [wfmash::map] time spent computing the reference index: 127.962 sec [wfmash::skch::Map::mapQuery] mapped 100.00% @ 2.02e+07 bp/s elapsed: 00:00:03:01 remain: 00:00:00:00 [wfmash::skch::Map::mapQuery] count of mapped reads = 1, reads qualified for mapping = 2, total input reads = 2, total input bp = 3658449192 [wfmash::map] time spent mapping the query: 1.81e+02 sec [wfmash::map] mapping results saved in: /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::align] Reference = [input.fa] [wfmash::align] Query = [input.fa] [wfmash::align] Mapping file = /mnt/agolicz/fababean_graph/wfmash-H8aTUq [wfmash::align] Alignment identity cutoff = 7.60e-01% [wfmash::align] Alignment output file = /dev/stdout [wfmash::align] time spent loading the reference index: 7.86e-03 sec [wfmash::align::computeAlignments] aligned 57.22% @ 4.45e+04 bp/s elapsed: 00:05:42:33 remain: 00:04:16:04Command terminated by signal 9 wfmash -X -s 300000 -k 52 -p 95 -n 2 -t 40 input.fa input.fa 4545239.25s user 21913.28s system 2375% cpu 192223.15s total 1050026284Kb max memory

Any suggestions of what could be done to redece memory use?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPMM3ND4BKI6UUCWKLVFQHZXANCNFSM5TT4Q3WQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

agolicz commented 2 years ago

I have to try again. Master from 15/06 was completing fine on defaults (-n2 -v -t40), took a couple of days with ~60% alignment rate. Master 20/07 was not completing on defaults (had to kill the run, stuck on wfmash alignment). I got busy afterwards, but will try the newest and see what happens.

agolicz commented 2 years ago

Unfortunately wfmash master from yesterday is also stuck (for about 12 hrs now).

pggb -i input.fa -n2 -v -t40 --output-dir pggbt.11102022

[wfmash::map] Reference = [input.fa]
[wfmash::map] Query = [input.fa]
[wfmash::map] Kmer size = 19
[wfmash::map] Window size = 136
[wfmash::map] Segment length = 5000 (read split allowed)
[wfmash::map] Block length min = 25000
[wfmash::map] Chaining gap max = 100000
[wfmash::map] Percentage identity threshold = 90%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = pggbt.11102022/wfmash-MfOViv
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads  = 40
[wfmash::skch::Sketch::build] minimizers picked from reference = 54195372
[wfmash::skch::Sketch::index] unique minimizers = 7612150
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 3298728) ... (94005, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minimizers occurring >= 13521 times during lookup.
[wfmash::map] time spent computing the reference index: 99.0164 sec
[wfmash::skch::Map::mapQuery] mapped 50.00% @ 9.63e+07 bp/s elapsed: 00:00:00:19 remain: 00:00:00:19

Any chance you could have a look what might be going on? You might still have the faba test data (chr1L). Also I noticed, wfmash never seems to use more than ~200% CPU, is that normal?

Agnieszka

AndreaGuarracino commented 2 years ago

Recently, we merged a few improvements in wfmash to better manage higher divergence. That could improve your alignment time. If you can play with GitHub, I would suggest updating your wfmash installation to the current master branch.

However, looking at your log, wfmash is stuck at the mapping phase, so before the alignment. With small segment sizes (necessary to have a good sensitivity) and low identity thresholds, the mapping time increases. We don't have hot alternatives yet, but if you want to try, you can add -w 512 or -w 1024 to the wfmash instruction in PGGB (https://github.com/pangenome/pggb/blob/master/pggb#L406-L416). This parameter says how much information is sampled from the sequences for the mapping. Higher w​ values lead to less information, leading to lower runtimes for the mapping, but also a bit less accurate mappings ("it is better to have something less accurate than nothing" philosophy).