pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
196 stars 40 forks source link

prevent quadratic untangle cut point explosions #474

Open ekg opened 1 year ago

ekg commented 1 year ago

That's the premise of this PR. It seems that untangling isn't behaving as expected in deep loops, and this branch will have hacks required to figure out why. It may be due to a quadratic expansion of the number of possible cut points.

The current patch makes a change that does affect the untangle cut points. We need to validate that the output makes sense.

ekg commented 1 year ago

Are we ready to merge? I guess we should check larger scale tests.

AndreaGuarracino commented 1 year ago

In terms of performance, it is good. I tried it on multiple chrXY graphs of primates. The slowest run took ~6 hours.

octopus11
[odgi::untangle] warning: no step index specified. Building one with a sample rate of 8. This may take additional time. A step index can be provided via -a, --step-index. A step index can be built using odgi stepindex.
[odgi::algorithms::stepindex] Collecting Steps Progress: 100.00% @ 3.97e+00/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::algorithms::stepindex] Building Progress: 100.00% @ 2.89e+00/s elapsed: 00:00:00:04 remain: 00:00:00:00
[odgi::algorithms::untangle] untangling 14 queries with 1 targets
[odgi::algorithms::untangle] establishing initial cuts for 14 paths
[odgi::algorithms::untangle] set target nodes 100.00% @ 5.29e-01/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 1.42e-03/s elapsed: 00:00:11:44 remain: 00:00:00:00
[odgi::algorithms::untangle] mark nodes every 50000 bp 100.00% @ 3.17e+07/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::algorithms::untangle] add new cut points 100.00% @ 4.08e+00/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::algorithms::untangle] building target segment index
[odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 6.20e-04/s elapsed: 00:00:26:53 remain: 00:00:00:00
[odgi::algorithms::untangle] prepare segment cuts 100.00% @ 4.38e-01/s elapsed: 00:00:00:02 remain: 00:00:00:00
[odgi::algorithms::untangle] make the mapping 100.00% @ 3.73e+07/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::algorithms::untangle] untangling 14 queries 100.00% @ 7.26e-04/s elapsed: 00:05:21:32 remain: 00:00:00:00
    Command being timed: "/home/guarracino/tools/odgi/bin/odgi-547e4d76f340dc0da34bae1522308e9b97efd0c9 untangle -i /lizardfs/guarracino/tree_of_life_alignment/graphs/primates14.chrXY.p70.s5000/primates14.chrXY.fa.gz.174e029.417fcdf.a903a91.smooth.final.og -r mGorGor1#chrY -e 50000 -m 1000 -j 0 -n 28 -t 14 -P"
    User time (seconds): 75629.28
    System time (seconds): 140.30
    Percent of CPU this job got: 349%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 6:01:10
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 25976056
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 42828892
    Voluntary context switches: 41562533
    Involuntary context switches: 82776
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
ekg commented 1 year ago

I wonder if we'd go faster without the sampled step index.

On Thu, Feb 9, 2023 at 2:08 PM Andrea Guarracino @.***> wrote:

In terms of performance, it is good. I tried it on multiple chrXY graphs of primates. The slowest run took ~6 hours.

octopus11 [odgi::untangle] warning: no step index specified. Building one with a sample rate of 8. This may take additional time. A step index can be provided via -a, --step-index. A step index can be built using odgi stepindex. [odgi::algorithms::stepindex] Collecting Steps Progress: 100.00% @ 3.97e+00/s elapsed: 00:00:00:03 remain: 00:00:00:00 [odgi::algorithms::stepindex] Building Progress: 100.00% @ 2.89e+00/s elapsed: 00:00:00:04 remain: 00:00:00:00 [odgi::algorithms::untangle] untangling 14 queries with 1 targets [odgi::algorithms::untangle] establishing initial cuts for 14 paths [odgi::algorithms::untangle] set target nodes 100.00% @ 5.29e-01/s elapsed: 00:00:00:01 remain: 00:00:00:00 [odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 1.42e-03/s elapsed: 00:00:11:44 remain: 00:00:00:00 [odgi::algorithms::untangle] mark nodes every 50000 bp 100.00% @ 3.17e+07/s elapsed: 00:00:00:01 remain: 00:00:00:00 [odgi::algorithms::untangle] add new cut points 100.00% @ 4.08e+00/s elapsed: 00:00:00:03 remain: 00:00:00:00 [odgi::algorithms::untangle] building target segment index [odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 6.20e-04/s elapsed: 00:00:26:53 remain: 00:00:00:00 [odgi::algorithms::untangle] prepare segment cuts 100.00% @ 4.38e-01/s elapsed: 00:00:00:02 remain: 00:00:00:00 [odgi::algorithms::untangle] make the mapping 100.00% @ 3.73e+07/s elapsed: 00:00:00:00 remain: 00:00:00:00 [odgi::algorithms::untangle] untangling 14 queries 100.00% @ 7.26e-04/s elapsed: 00:05:21:32 remain: 00:00:00:00 Command being timed: "/home/guarracino/tools/odgi/bin/odgi-547e4d76f340dc0da34bae1522308e9b97efd0c9 untangle -i /lizardfs/guarracino/tree_of_life_alignment/graphs/primates14.chrXY.p70.s5000/primates14.chrXY.fa.gz.174e029.417fcdf.a903a91.smooth.final.og -r mGorGor1#chrY -e 50000 -m 1000 -j 0 -n 28 -t 14 -P" User time (seconds): 75629.28 System time (seconds): 140.30 Percent of CPU this job got: 349% Elapsed (wall clock) time (h:mm:ss or m:ss): 6:01:10 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 25976056 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 42828892 Voluntary context switches: 41562533 Involuntary context switches: 82776 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/pull/474#issuecomment-1424746622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOC2XUCQ72DFCKGVGDWWVFE7ANCNFSM6AAAAAAUTG4ITM . You are receiving this because you authored the thread.Message ID: @.***>

AndreaGuarracino commented 1 year ago

There is still a problem with the reverse orientation. This is from the injection tutorial, but without the odgi flip step:

image master on the left, this branch on the right

The tracks in reverse are split into more arrows. Applying odgi flip, the output becomes identical.

AndreaGuarracino commented 1 year ago

Actually, there are also a few problems in forward.

AndreaGuarracino commented 1 year ago

A bit better, but still more arrows than the master (on the left):

image

AndreaGuarracino commented 1 year ago

And also different best targets (different colors).

AndreaGuarracino commented 1 year ago

Almost 20 hours. Not 10 minutes as before the last fixing commit, but neither days of waiting as with the master.

octopus03
[odgi::untangle] warning: no step index specified. Building one with a sample rate of 8. This may take additional time. A step index can be provided via -a, --step-index. A step index can be built using odgi stepindex.
[odgi::algorithms::stepindex] Collecting Steps Progress: 100.00% @ 4.59e+00/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::algorithms::stepindex] Building Progress: 100.00% @ 3.41e+00/s elapsed: 00:00:00:04 remain: 00:00:00:00
[odgi::algorithms::untangle] untangling 14 queries with 1 targets
[odgi::algorithms::untangle] establishing initial cuts for 14 paths
[odgi::algorithms::untangle] set target nodes 100.00% @ 4.79e-01/s elapsed: 00:00:00:02 remain: 00:00:00:00
[odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 2.75e-05/s elapsed: 00:10:06:48 remain: 00:00:00:00
[odgi::algorithms::untangle] mark nodes every 50000 bp 100.00% @ 2.58e+07/s elapsed: 00:00:00:01 remain: 00:00:00:00
[odgi::algorithms::untangle] add new cut points 100.00% @ 6.22e+00/s elapsed: 00:00:00:02 remain: 00:00:00:00
[odgi::algorithms::untangle] building target segment index
[odgi::algorithms::untangle] untangle and merge cuts 100.00% @ 1.02e-03/s elapsed: 00:00:16:16 remain: 00:00:00:00
[odgi::algorithms::untangle] prepare segment cuts 100.00% @ 4.79e-01/s elapsed: 00:00:00:02 remain: 00:00:00:00
[odgi::algorithms::untangle] make the mapping 100.00% @ 2.44e+07/s elapsed: 00:00:00:00 remain: 00:00:00:00
[odgi::algorithms::untangle] untangling 14 queries 100.00% @ 4.42e-04/s elapsed: 00:08:48:01 remain: 00:00:00:00
    Command being timed: "/home/guarracino/tools/odgi/bin/odgi-2c78159a1b4bf122493075e436ea9c53033f430f untangle -i /lizardfs/guarracino/tree_of_life_alignment/graphs/primates14.chrXY.p90.s20000/primates14.chrXY.fa.gz.e4728fd.417fcdf.8abe853.smooth.final.og -r chm13v2#chrX -e 50000 -m 1000 -j 0 -n 28 -t 14 -P"
    User time (seconds): 349928.09
    System time (seconds): 282.15
    Percent of CPU this job got: 506%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 19:11:58
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 22090900
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 328758411
    Voluntary context switches: 82374953
    Involuntary context switches: 544237
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
ekg commented 1 year ago

Yup this is still not done.

ekg commented 1 year ago

Screenshot from 2023-02-15 16-57-38

This is looking more correct. I'm not seeing the same problems. Looking to see if there are further wins to be made.