vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 193 forks source link

vg autoindex needs a --buffer-size parameter for vg gbwt #3455

Open brettChapman opened 2 years ago

brettChapman commented 2 years ago

Hi

While running vg autoindex I get complaints about the sequence length being too long. I had the same problem when running vg gbwt, so I set the buffer size to 1000.

Could the same parameter be set with vg autoindex?

I'm running vg version 1.35.

Thanks.

brettChapman commented 2 years ago

Is there a secret menu with autoindex where I can access these hidden parameters, like was mentioned here: https://github.com/vgteam/vg/issues/3303

Otherwise, how would I break down the steps in autoindex? Does autoindex provide a means to perform a dryrun by outputting all commands it will use?

jltsiren commented 2 years ago

vg autoindex does some things that are not easily replicated manually, such as determining the number of parallel GBWT construction jobs based on estimated memory requirements and available CPU cores. Adding a buffer size parameter for GBWT construction (or estimating it from reference path lengths) would be straightforward, but it would break our current memory usage estimates.

If you want to build indexes for vg giraffe, the first step is building a GBZ file (graph + GBWT). This is usually the hardest part. The exact steps depend on your input and on whether you also want to build other graphs / indexes.

If you have a GFA file of a reasonable size with a reasonable number of haplotypes as P-lines and/or W-lines, you can do this with vg gbwt -g graph.gbz --gbz-format -G graph.gfa. This also determines a good buffer size automatically, chops the GFA segments into at most 1024 bp nodes, and stores a translation table between segment names and (ranges of) node ids in the GBZ file. Things are easiest when the GFA file has the reference genome as P-lines and other haplotypes as W-lines. If the GFA file is too large for a single GBWT construction job, there is no solution at the moment, as we have not seen such GFA files yet.

The GBZ graph can be converted into other graph formats with commands such as vg convert -x -Z graph.gbz > graph.xg.

If you built the GBZ graph from GFA with vg gbwt, it now serves as baseline graph. If you built the graph and the GBWT separately from another input type, those files are your baseline. You should never touch the input again, as other graphs / indexes built from the input may be incompatible with what you already have. All subsequent graphs and indexes should be descendants of the baseline.

Once you have the GBZ file, you can find snarls and build the distance index and the minimizer index with:

vg snarls -T graph.gbz > graph.snarls
vg index -s graph.snarls -j graph.dist graph.gbz
vg minimizer -o graph.min -d graph.dist graph.gbz

Snarl finding can use a lot of memory, while the other commands should have more reasonable requirements. You can reduce the memory usage by splitting the graph into multiple parts, each of them corresponding to one or more graph components. For example:

rm -f graph.snarls
for i in $(seq 1 22; echo X; echo Y); do
    vg snarls -T chr${i}.vg >> graph.snarls
done

Splitting the graph by component should be possible with vg chunk, but I have never used the command myself. You may also get a huge number of files if there are many small components/contigs in the graph.

Things get a bit more complicated if you have hundreds/thousands of haplotypes. You may then want to build a subsampled GBZ with vg gbwt -l and use that GBZ with vg minimizer and vg giraffe (but not in other commands).

brettChapman commented 2 years ago

I tried generating a GBZ graph, generating snarls and then indexing. However my job gets killed at the indexing step:

line 25: 1527790 Killed                  singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg index -t 2 -b ${tmp_dir} -s ${SNARLS} -j ${DIST} ${PANGENOME_GBZ}
error[VPKG::load_one]: Could not open barley_pangenome_graph.dist while loading vg::MinimumDistanceIndex

I ran dmesg -T| grep -E -i -B100 'killed process' to see why, and I get this:

[Sat Oct 30 16:47:27 2021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1001.slice/session-36.scope,task=vg,pid=1527810,uid=1001
[Sat Oct 30 16:47:27 2021] Out of memory: Killed process 1527810 (vg) total-vm:2297177832kB, anon-rss:1576888580kB, file-rss:44kB, shmem-rss:0kB, UID:1001 pgtables:3401188kB oom_score_adj:0

It looks like it's requesting 2.2TB of virtual memory, when I only have 1.4TB of RAM and 2GB of swap space. Apart from increasing my swap space to over 1TB, is there another way around indexing which isn't so taxing on memory? Thanks.

brettChapman commented 2 years ago

To provide context. I only have pseudomolecules, so no contigs, only chromosomes 1 to 7, and I only have 20 haplotypes in the graph.

jltsiren commented 2 years ago

It looks like the issue may be with the graph itself. You may want to visualize the graph to see if the overall structure looks reasonable or if there are large tangled subgraphs that would make distance index construction expensive. The PGGB team has spent a lot of effort trying to build useful human pangenome graphs, and they may be able to help you with parameter choices if you try rebuilding the graph.

ekg commented 2 years ago

To reduce the bubble complexity, we can either use a longer segment length in mapping (wfmash -s) or use the pruning tools in vg or odgi to remove complex regions. For humans these are usually centromeric, and easy to find because they have very high depth in the graph.

On Mon, Nov 1, 2021, 21:35 Jouni Siren @.***> wrote:

It looks like the issue may be with the graph itself. You may want to visualize the graph to see if the overall structure looks reasonable or if there are large tangled subgraphs that would make distance index construction expensive. The PGGB team has spent a lot of effort trying to build useful human pangenome graphs, and they may be able to help you with parameter choices if you try rebuilding the graph.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3455#issuecomment-956570103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPEN7EQP3DTWKLGF7TUJ32ZRANCNFSM5GKQCYGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

brettChapman commented 2 years ago

Thanks @ekg

I used 1Mbp as the segment length, so it's already quite a long segment length.

Could I use vg prune like outlined in this old issue: https://github.com/vgteam/vg/issues/1879

Would it be advisable to run it on the entire graph (containing all 7 chromosomes?). I would prefer to prune over running the entire PGGB workflow again. Which vg prune parameters should be used? Just go with the default?

brettChapman commented 2 years ago

I've increased the swap space on my machine now, so there should be 2.2TB of available virtual memory. I'm trying to index again and see how it goes.

I'm also pruning the graph using this strategy:

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg ids -j -m mapping $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.pg; done)
cp mapping mapping.backup

for i in $(seq 1 7); do
   singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg prune -u -a -m mapping barley_pangenome_graph_${i}H.pg > barley_pangenome_graph_${i}H.pruned.vg
done

I'll compare how the performance is between pruned graph vs non-pruned. Generally from experience, has anyone found pruning to be an essential step? And what is the cost in terms of reduce the complexity of the graph for read alignment and variant calling? My down-stream analysis will be genome read alignment for variant calling and later generation of a splice graph for RNA-seq read alignment.

jeizenga commented 2 years ago

In my experience, pruning is completely essential if you want to index with GCSA2 (used by vg map and vg mpmap), or else the exponential worst case can be a real killer. In those indexing pipelines, we only use the pruned graph for the indexing step, after which we discard it. In particular, we align to the full graph during read mapping. The only cost is that the GCSA2 index cannot query matches that cross some edges in complex regions, and I have never found a case where we generated an incorrect mapping because of this limitation.

brettChapman commented 2 years ago

Thanks @jeizenga for your input.

I tried indexing the pruned graph but came up with an error:

I first create an XG of the whole graph from each pruned packed graph:

vg index -x barley_pangenome_graph.pruned.xg -b ${tmp_dir} $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.pruned.pg; done)

Then generate a GFA:

vg convert -t 2 barley_pangenome_graph.pruned.xg -f > barley_pangenome_graph.pruned.gfa 

Then I attempt to create a GBZ but it gives me an error about paths or walks not in my pruned GFA:

vg gbwt -d ${tmp_dir} -g barley_pangenome_graph.pruned.gbz --gbz-format -G barley_pangenome_graph.pruned.gfa
check_gfa_file(): No paths or walks in the GFA file
error: [vg gbwt] GBWT construction from GFA failed
jeizenga commented 2 years ago

I suspect you probably ran pruning without --restore-paths, which can lead it to remove edges that embedded paths take. When it does this, there's no unambiguous way for the graph to express the embedded path, so the path is dropped. The GBZ is really an index of paths, so without some source of paths it can't be constructed. However, one alternative is vg gbwt --path-cover, which tries to synthesize paths using local heuristics.

Is your end goal to use this GBZ for vg giraffe? If so, this might all be beside the point. Pruning is really intended for GCSA2 indexing, which vg giraffe doesn't use.

brettChapman commented 2 years ago

Thanks @jeizenga

Yes, my intent is to use vg giraffe for genomic read alignment and vg mpmap for RNA-seq read alignment.

The main reason I'm now attempting the pruning approach is because the indexing step used a huge amount of RAM. Should I be pruning or not to get around the indexing problem? I've increased my swap space on my machine, so I may be able to index now without pruning, but it may take a lot longer to index.

jltsiren commented 2 years ago

vg prune is only intended for pruning the graph for GCSA construction. It deliberately drops all paths in the graph, because paths are not needed for kmer generation and maintaining them can be slow and complicated. I'm not sure if odgi prune can split the paths instead of dropping them.

brettChapman commented 2 years ago

@ekg @jltsiren @jeizenga I've tried odgi prune using these parameters:

odgi prune -i barley_pangenome_graph_1H.og -o barley_pangenome_graph_1H.pruned.og -c 3 -C 345 -T

Then converted to GFA, then PG:

odgi view -i barley_pangenome_graph_${i}H.pruned.og -g > barley_pangenome_graph_${i}H.pruned.gfa
vg convert -g barley_pangenome_graph_${i}H.pruned.gfa -p > barley_pangenome_graph_${i}H.pruned.pg

Then tried to index all graphs into one XG file:

vg index -x barley_pangenome_graph.pruned.xg -b ${tmp_dir} $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.pruned.pg; done)

I then tried to produce snarls and then index, but it fails at the index stage saying it cant read in the distance index.

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg convert -t 2 ${PANGENOME_XG} -f > barley_pangenome_graph.pruned.gfa

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg gbwt -d ${tmp_dir} -g ${PANGENOME_GBZ} --gbz-format -G ${PANGENOME_GFA}

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg snarls -t 2 -T ${PANGENOME_GBZ} > ${SNARLS}
singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg index -t 2 -b ${tmp_dir} -s ${SNARLS} -j ${DIST} 
${PANGENOME_GBZ}
singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg minimizer -t 2 -o ${MIN} -d ${DIST} ${PANGENOME_GBZ}

Regardless of whether I've pruned the graph or not, I've gotten this same error before. It appears it was killed by OOM after requesting 9TB of virtual RAM.

I came across vg simplify but this is from an old wiki: https://github.com/vgteam/vg/wiki/Indexing-Huge-Datasets, so I'm not sure if it's something which should be used to radically reduce the complexity of the graph.

How am I supposed to prepare my graph to use with giraffe, apart from increasing my swap space to 10TB, which is an insane amount of swap space to use. Should I be even more stringent with odgi prune parameters? Perhaps reducing the -C parameter value? Is there a more efficient distance indexing approach? I remember there was mention of another branch of vg which was using a different approach to distance indexing? It's mentioned here by @xchang1 https://github.com/vgteam/vg/issues/3303

xchang1 commented 2 years ago

You can try building the distance index with this branch. Master branch won't recognize this distance index so you'll have to run giraffe from this branch too https://github.com/vgteam/vg/tree/for-brett

The command for building the distance index is the same, except that it no longer takes the snarl file as input. Instead, -s is the size limit for snarls. I'm guessing your problem is that the snarls are too big, and -s will tell the distance index not to build the whole index for big snarls. The default value is 500, but I haven't tried this on snarls that big so it might need some tuning.

If you also run out of memory with that branch, you can try the one I mentioned earlier, https://github.com/vgteam/vg/tree/distance-big-snarls It would be better if you could get it working on the for-brett branch but if it only works with the distance-big-snarls branch then I can get that version working with giraffe too.

Do you know how big your snarls are? You can find it from vg stats -R, the net graph size is the value I'm interested in

brettChapman commented 2 years ago

Thanks @xchang1

I've been trying to install the 'for-brett' branch, but am struggling to get it installed. I've usually installed vg from docker.

I first clone the branch, cd into vg/ and then I install using the Dockerfile from the git repository.

I've been getting this error:

-type -std=c++14 -ggdb -g  -march=nehalem  -fopenmp -msse4.2 -MMD -MP -c -o obj/subcommand/join_main.o src/subcommand/join_main.cpp 
. ./source_me.sh && /usr/bin/g++ -I/vg/include -isystem /vg/include -I. -I/vg/src -I/vg/src/unittest -I/vg/src/subcommand -I/vg/include/dynamic -pthread -isystem /usr/include/cairo -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /usr/include/pixman-1 -isystem /usr/include/uuid -isystem /usr/include/freetype2 -isystem /usr/include/libpng16  -O3 -Werror=return-type -std=c++14 -ggdb -g  -march=nehalem  -fopenmp -msse4.2 -MMD -MP -c -o obj/subcommand/translate_main.o src/subcommand/translate_main.cpp 
. ./source_me.sh && /usr/bin/g++ -I/vg/include -isystem /vg/include -I. -I/vg/src -I/vg/src/unittest -I/vg/src/subcommand -I/vg/include/dynamic -pthread -isystem /usr/include/cairo -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /usr/include/pixman-1 -isystem /usr/include/uuid -isystem /usr/include/freetype2 -isystem /usr/include/libpng16  -O3 -Werror=return-type -std=c++14 -ggdb -g  -march=nehalem  -fopenmp -msse4.2 -MMD -MP -c -o obj/subcommand/giraffe_main.o src/subcommand/giraffe_main.cpp 
src/subcommand/giraffe_main.cpp:37:10: fatal error: valgrind/callgrind.h: No such file or directory
   37 | #include <valgrind/callgrind.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:769: obj/subcommand/giraffe_main.o] Error 1
make: *** Waiting for unfinished jobs....
The command '/bin/sh -c . ./source_me.sh && CXXFLAGS="$(if [ -z "${TARGETARCH}" ] || [ "${TARGETARCH}" = "amd64" ] ; then echo " -march=nehalem "; fi)" make -j $((THREADS < $(nproc) ? THREADS : $(nproc))) objs' returned a non-zero code: 2

I'm running vg stats -R on my pruned XG graph now. I'll get back to you when I know the snarl size. Thanks.

brettChapman commented 2 years ago

vg stats -R has finished running on the pruned XG graph. The largest snarl is 215479 in size.

xchang1 commented 2 years ago

Oof that's a big snarl. That is definitely causing you problems

We should have built a docker container for the branch automatically but it failed some tests and it didn't build properly. I'll rerun the build and then you should be able to find the container here

https://quay.io/repository/vgteam/vg?tab=tags

brettChapman commented 2 years ago

Thanks @xchang1

I checked the docker repository but couldn't see the build there. Would it have the 'for-brett' tag?

I also ran vg stats -R on the original non-pruned graph, and the largest snarl is 611408.

xchang1 commented 2 years ago

Can you try building it again now? I fixed the compilation problem but I'm having trouble getting the tests to pass and I can't get the static binary working either

brettChapman commented 2 years ago

Sure, I'll try and build from the Dockerfile again. I'll let you know how it goes.

brettChapman commented 2 years ago

The build failed:

/usr/bin/gcc -std=gnu11 -Wall -Wextra -Wsign-compare -Wundef -Wno-format-zero-length -Wpointer-arith -Wno-missing-braces -Wno-missing-field-initializers -pipe -g3 -fvisibility=hidden -Wimplicit-fallthrough -O3 -funroll-loops -I /vg/include -I /vg/include -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/witness.o src/witness.c
/usr/bin/g++ -std=c++14 -Wall -Wextra -g3 -fvisibility=hidden -Wimplicit-fallthrough -O3 -I /vg/include -I/vg/include/dynamic -O3 -Werror=return-type -std=c++14 -ggdb -g -march=nehalem -fopenmp -msse4.2 -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -o src/jemalloc_cpp.o src/jemalloc_cpp.cpp
ar crus lib/libjemalloc_pic.a src/jemalloc.pic.o src/arena.pic.o src/background_thread.pic.o src/base.pic.o src/bin.pic.o src/bin_info.pic.o src/bitmap.pic.o src/buf_writer.pic.o src/cache_bin.pic.o src/ckh.pic.o src/counter.pic.o src/ctl.pic.o src/decay.pic.o src/div.pic.o src/ecache.pic.o src/edata.pic.o src/edata_cache.pic.o src/ehooks.pic.o src/emap.pic.o src/eset.pic.o src/exp_grow.pic.o src/extent.pic.o src/extent_dss.pic.o src/extent_mmap.pic.o src/fxp.pic.o src/hook.pic.o src/hpa.pic.o src/hpa_central.pic.o src/hpdata.pic.o src/inspect.pic.o src/large.pic.o src/log.pic.o src/malloc_io.pic.o src/mutex.pic.o src/mutex_pool.pic.o src/nstime.pic.o src/pa.pic.o src/pa_extra.pic.o src/pac.pic.o src/pages.pic.o src/peak_event.pic.o src/prof.pic.o src/prof_data.pic.o src/prof_log.pic.o src/prof_recent.pic.o src/prof_stats.pic.o src/prof_sys.pic.o src/psset.pic.o src/rtree.pic.o src/safety_check.pic.o src/sc.pic.o src/sec.pic.o src/stats.pic.o src/sz.pic.o src/tcache.pic.o src/test_hooks.pic.o src/thread_event.pic.o src/ticker.pic.o src/tsd.pic.o src/witness.pic.o src/jemalloc_cpp.pic.o
ar: `u' modifier ignored since `D' is the default (see `U')
/usr/bin/gcc -shared -Wl,-soname,libjemalloc.so.2  -o lib/libjemalloc.so.2 src/jemalloc.pic.o src/arena.pic.o src/background_thread.pic.o src/base.pic.o src/bin.pic.o src/bin_info.pic.o src/bitmap.pic.o src/buf_writer.pic.o src/cache_bin.pic.o src/ckh.pic.o src/counter.pic.o src/ctl.pic.o src/decay.pic.o src/div.pic.o src/ecache.pic.o src/edata.pic.o src/edata_cache.pic.o src/ehooks.pic.o src/emap.pic.o src/eset.pic.o src/exp_grow.pic.o src/extent.pic.o src/extent_dss.pic.o src/extent_mmap.pic.o src/fxp.pic.o src/hook.pic.o src/hpa.pic.o src/hpa_central.pic.o src/hpdata.pic.o src/inspect.pic.o src/large.pic.o src/log.pic.o src/malloc_io.pic.o src/mutex.pic.o src/mutex_pool.pic.o src/nstime.pic.o src/pa.pic.o src/pa_extra.pic.o src/pac.pic.o src/pages.pic.o src/peak_event.pic.o src/prof.pic.o src/prof_data.pic.o src/prof_log.pic.o src/prof_recent.pic.o src/prof_stats.pic.o src/prof_sys.pic.o src/psset.pic.o src/rtree.pic.o src/safety_check.pic.o src/sc.pic.o src/sec.pic.o src/stats.pic.o src/sz.pic.o src/tcache.pic.o src/test_hooks.pic.o src/thread_event.pic.o src/ticker.pic.o src/tsd.pic.o src/witness.pic.o src/jemalloc_cpp.pic.o  -lm -lstdc++ -pthread 
ln -sf libjemalloc.so.2 lib/libjemalloc.so
ar crus lib/libjemalloc.a src/jemalloc.o src/arena.o src/background_thread.o src/base.o src/bin.o src/bin_info.o src/bitmap.o src/buf_writer.o src/cache_bin.o src/ckh.o src/counter.o src/ctl.o src/decay.o src/div.o src/ecache.o src/edata.o src/edata_cache.o src/ehooks.o src/emap.o src/eset.o src/exp_grow.o src/extent.o src/extent_dss.o src/extent_mmap.o src/fxp.o src/hook.o src/hpa.o src/hpa_central.o src/hpdata.o src/inspect.o src/large.o src/log.o src/malloc_io.o src/mutex.o src/mutex_pool.o src/nstime.o src/pa.o src/pa_extra.o src/pac.o src/pages.o src/peak_event.o src/prof.o src/prof_data.o src/prof_log.o src/prof_recent.o src/prof_stats.o src/prof_sys.o src/psset.o src/rtree.o src/safety_check.o src/sc.o src/sec.o src/stats.o src/sz.o src/tcache.o src/test_hooks.o src/thread_event.o src/ticker.o src/tsd.o src/witness.o src/jemalloc_cpp.o
ar: `u' modifier ignored since `D' is the default (see `U')
make[1]: Leaving directory '/vg/deps/jemalloc'
Removing intermediate container 28c6291c21e2
 ---> c9e3ab6aa802
Step 21/42 : COPY include /vg/include
COPY failed: file not found in build context or excluded by .dockerignore: stat include: file does not exist
xchang1 commented 2 years ago

Hmm I'm not very familiar with Docker, I'll ask someone who is. In the meantime, did you clone the branch with --recursive? It looks like the /include directory isn't in the GitHub repo so maybe you don't need it to compile vg initially and you can just remove that line from the Dockerfile?

brettChapman commented 2 years ago

Yeah, I clone the branch with --recursive

I do the following:

git clone --recursive --branch for-brett https://github.com/vgteam/vg.git
cd vg/
docker build -t local/vg .
brettChapman commented 2 years ago

I'll try removing /include and run again

brettChapman commented 2 years ago

It got further along now, but failed a series of tests:

graph: valid
graph: valid
graph: valid
graph: valid
t/53_clip.t ........... 
1..13
ok 1 - clipped graph is valid
ok 2 - every step in clipped graph belongs to reference path
ok 3 - clipped graph has same length as ref path
ok 4 - clipped graph is valid
ok 5 - Just one node filtered
ok 6 - clipped graph is valid
ok 7 - Just one edge filtered
ok 8 - clipped graph is valid
ok 9 - Just one node filtered
ok 10 - clipped graph is valid
ok 11 - clipping bad region changes nothing
ok 12 - clipped graph is valid
ok 13 - Just one node filtered
ok

Test Summary Report
-------------------
t/06_vg_index.t     (Wstat: 0 Tests: 55 Failed: 6)
  Failed tests:  50-55
t/33_vg_mpmap.t     (Wstat: 256 Tests: 19 Failed: 5)
  Failed tests:  15-19
  Non-zero exit status: 1
t/40_vg_gamcompare.t (Wstat: 0 Tests: 7 Failed: 1)
  Failed test:  4
t/46_vg_minimizer.t (Wstat: 0 Tests: 16 Failed: 2)
  Failed tests:  12-13
t/50_vg_giraffe.t   (Wstat: 0 Tests: 27 Failed: 8)
  Failed tests:  1-7, 10
t/52_vg_autoindex.t (Wstat: 0 Tests: 24 Failed: 8)
  Failed tests:  10-12, 14-18
Files=52, Tests=932, 286 wallclock secs ( 0.36 usr  0.06 sys + 521.47 cusr 110.05 csys = 631.94 CPU)
Result: FAIL
make: *** [Makefile:413: test] Error 1
The command '/bin/sh -c /bin/bash -e -c "export OMP_NUM_THREADS=$((THREADS < $(nproc) ? THREADS : $(nproc))); make test"' returned a non-zero code: 2
xchang1 commented 2 years ago

Yeah, I haven't fully integrated some of my changes into vg yet so it'll fail some tests. Giraffe should work ok, things just aren't in the format the tests expect. Can you run it without the tests?

brettChapman commented 2 years ago

Because the tests fail the entire installation into Docker fails too. Is there a way to skip tests?

brettChapman commented 2 years ago

Theres one line with make test in the Dockerfile. I'm skipping it now and rerunning.

brettChapman commented 2 years ago

I managed to get it working now by commenting out the /include and the make test line in the Dockerfile.

I'm running vg index now with setting -s to the maximum snarl size of 215479

brettChapman commented 2 years ago

vg reported an error whiling trying to index

ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Stack trace path: /tmp/vg_crash_BB46yt/stacktrace.txt
Please include the stack trace file in your bug report!
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Stack trace path: /tmp/vg_crash_fsF5KA/stacktrace.txt
Please include the stack trace file in your bug report!

Both those stack traces contained:

bchapman@ubuntu:/media/hhd1/pangenome_snp_calling_latest$ cat /tmp/vg_crash_BB46yt/stacktrace.txt
Crash report for vg not-from-git
Stack trace (most recent call last):
#0    Object "", at 0, in 
bchapman@ubuntu:/media/hhd1/pangenome_snp_calling_latest$ cat /tmp/vg_crash_fsF5KA/stacktrace.txt
Crash report for vg not-from-git
Stack trace (most recent call last):
#0    Object "", at 0, in 
xchang1 commented 2 years ago

There's not a lot of information in that stack trace. It looks a bit like when I tried to run it on a static binary though. Are you using the Dockerfile or Dockerfile.static? The make command shouldn't have static in it. If that wasn't, someone else suggested cloning vg in the dockerfile. Maybe that would give you a better stack trace? I'll try running one of my examples in a Dockerfile too, maybe our Dockerfile is just missing something

Can you send your vg index command too please? The -s value should be pretty small; it's not the maximum size of any of your snarls, it's the maximum size of a snarl that the index will store the full version of. Normally, the distance index stores all distances for a snarl, which is quadratic in the size of the snarl. That's way too big for snarls that have hundreds of thousands of nodes so I added the -s option, which stores a subset of the distances for snarls that are bigger than N. The default value of N is 200, but I haven't tuned that parameter at all. Basically a bigger -s value means a bigger but faster index.

brettChapman commented 2 years ago

I used the Dockerfile.

Heres the vg index command I used:

vg index -t 2 -b ${tmp_dir} -s 215479 -j ${DIST} ${PANGENOME_GBZ}

Ok, so I shouldn't be setting -s to the largest snarl? Perhaps I could set it to the second largest snarl size or half the size of the largest snarl?

xchang1 commented 2 years ago

I just made a new docker container that might work. I haven't had a chance to build an actual graph yet but it runs my unit tests It's on dockerhub: xhchang/vg:test

Yeah, definitely not the largest snarl. I'd guess you'll want to keep it in the low thousands. Even half the size of the largest snarl is about 100,000 nodes, squared is 10,000,000,000, which will be about 10 GB just to store one snarl if I did the math right. Probably if you make a histogram of the snarl sizes, you'll just have a couple outliers that are really big and you can set -s to exclude the outliers

brettChapman commented 2 years ago

Thanks, I'll try out the new build.

The second largest snarl is around 31605, so its mainly the largest snarl of 215479 which is the single major outlier. The snarl size drops a few thousand in size each time. I'll try setting -s 10000 and see how it goes.

xchang1 commented 2 years ago

@glennhickey you did some pruning of the HPRC graph to get rid of big snarls, right? Do you have any suggestions? This barley graph has a snarl with 200,000 nodes. I think my abridged distance index will work, but I'm worried that giraffe will be slow with such big snarls

brettChapman commented 2 years ago

After running with -s 10000 over the weekend I get this error:

./run_vg_snp_call.sh: line 34: 808725 Killed                  singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg index -t 2 -b ${tmp_dir} -s 10000 -j ${DIST} ${PANGENOME_GBZ}
terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not load from file barley_pangenome_graph.dist: No such file or directory
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Stack trace path: /tmp/vg_crash_W5SEy4/stacktrace.txt
Please include the stack trace file in your bug report!
bchapman@ubuntu:/media/hhd1/pangenome_snp_calling_latest$ cat /tmp/vg_crash_W5SEy4/stacktrace.txt
Crash report for vg v1.5.0-10234-g37e6d0222
Stack trace (most recent call last):
#11   Object "/vg/bin/vg", at 0x43f989, in _start
#10   Object "/vg/bin/vg", at 0x1b37eb8, in __libc_start_main
#9    Object "/vg/bin/vg", at 0x414dd5, in main
#8    Object "/vg/bin/vg", at 0xb1ad67, in vg::subcommand::Subcommand::operator()(int, char**) const
#7    Object "/vg/bin/vg", at 0xafd12e, in main_minimizer(int, char**)
#6    Object "/vg/bin/vg", at 0x154579b, in handlegraph::TriviallySerializable::deserialize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#5    Object "/vg/bin/vg", at 0x1a80093, in __cxa_throw
#4    Object "/vg/bin/vg", at 0x1a82060, in std::terminate()
#3    Object "/vg/bin/vg", at 0x1a82015, in __cxxabiv1::__terminate(void (*)())
#2    Object "/vg/bin/vg", at 0x1b1a2e4, in __gnu_cxx::__verbose_terminate_handler()
#1    Object "/vg/bin/vg", at 0x1b48780, in abort
#0    Object "/vg/bin/vg", at 0x126ee17, in raise

It was killed by OOM after requesting 6TB of virtual memory. Perhaps I should use an even lower -s value?

xchang1 commented 2 years ago

Yeah, that's what I'd try next

brettChapman commented 2 years ago

I'm now trying an -s value 10x lower, so -s 1000

brettChapman commented 2 years ago

When using -s 1000 I still get my job killed by OOM. This time instead of the virtual memory request being 6TB RAM when using -s 10000, it's now been killed after requesting 11TB RAM, so it's gone up.

Would increasing -s reduce the memory requirement?

xchang1 commented 2 years ago

It shouldn't. How long is it taking to fail? It's possible that it just happened to hit an 11 TB snarl before it hit the 6 TB snarl and asked for more memory. I'll have to check if it's requesting more memory than it actually needs. Can you share your graph so I can try building it?

brettChapman commented 2 years ago

It took a few days until it failed. That is possible.

I could provide you with the un-pruned GFA files and see how you go with them? It's probably best I provide you with the original files as it may be something I'm doing leading up to the indexing the merged graphs which is causing the issue.

How can I upload the graphs to you?

ekg commented 2 years ago

These graphs are large but not 11TB big...!

Is this a manifestation of quadratic memory and runtime costs?

On Mon, Dec 20, 2021, 03:30 Brett Chapman @.***> wrote:

It took a few days until it failed. That is possible.

I could provide you with the un-pruned GFA files and see how you go with them? It's probably best I provide you with the original files as it may be something I'm doing leading up to the indexing the merged graphs which is causing the issue.

How can I upload the graphs to you?

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3455#issuecomment-997541439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOSRKDB7KHYNNVOD3DUR2INLANCNFSM5GKQCYGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

brettChapman commented 2 years ago

@ekg potentially. I did try pruning the graphs back using odgi prune -i barley_pangenome_graph_1H.og -o barley_pangenome_graph_1H.pruned.og -c 3 -C 345 -T, and then merging and indexing the final full graph but I still ran into memory issues as described.

xchang1 commented 2 years ago

@ekg I took out the quadratic component for the really big snarls- it should be linear for any snarl bigger than a given limit. I might have missed something in my code that is still requesting the memory but I can't find it @brettChapman it would be easier for me if you could send me the graph you're building the distance index from. Is it small enough to just send it on GitHub?

brettChapman commented 2 years ago

@xchang1 I'm building the distance index on the entire full merged graph. The GBZ graph I'm trying to build the distance index on is 42GB, the pruned GFA is 42GB and unpruned GFA is 144GB. So it's not really feasible to upload on here.

I've also gone back to running my workflow from the start to figure out where the problem is, and now for some reason I cannot run gbwt. It complains about no paths being in my graph, which is weird, as I'm sure I ran this command successfully before on the pruned graph (using odgi prune barley_pangenome_graph.og -o barley_pangenome_graph.pruned.og -c 3 -C 345 -T). I've tried so many different approaches to get it to work, I've probably lost what I did along the way. Here's my entire workflow, using the ODGI graphs output from PGGB to start with:

for i in $(seq 1 7); do
        singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi view -i barley_pangenome_graph_${i}H.og -g > barley_pangenome_graph_${i}H.gfa
        singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg convert -g barley_pangenome_graph_${i}H.gfa -p > barley_pangenome_graph_${i}H.pg
done

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg index -x ${PANGENOME_XG} -b ${tmp_dir} $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.pg; done)

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg convert -t 2 ${PANGENOME_XG} -f > ${PANGENOME_GFA}

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi build -g ${PANGENOME_GFA} -o barley_pangenome_graph.og

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi prune -i barley_pangenome_graph.og -o barley_pangenome_graph.pruned.og -c 3 -C 345 -T

singularity exec --bind ${PWD}:${PWD} ${PGGB_IMAGE} odgi view -i barley_pangenome_graph.pruned.og -g > barley_pangenome_graph.pruned.gfa

export TMPDIR=${tmp_dir}

singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg gbwt -d ${tmp_dir} -g ${PANGENOME_GBZ} --gbz-format -G barley_pangenome_graph.pruned.gfa
xchang1 commented 2 years ago

We can try this if we can manage to be online at the same time but we seem to have a pretty unfortunate time difference. Do you think you'll be online at around 9pm EST on Monday? If not I can try to get ahold of a complex graph to recreate the problem

https://github.com/magic-wormhole/magic-wormhole#readme

brettChapman commented 2 years ago

I'll try and be on, but I can't make any promises. I'm on leave from today until the 4th. What's 9pm EST in Perth, Australia?

Which files do you want to work with? The GBZ graph file? I could potentially download that onto my Google drive and share the link. What's your email?

xchang1 commented 2 years ago

I think 10am Tuesday. If not it can wait until after the holiday Yeah, just the GBZ. That would be much easier. My email is xhchang@ucsc.edu

brettChapman commented 2 years ago

I just realised the GBZ file is on a server on campus. I can usually log in through a vpn but it's no longer working for some reason. I'll have to wait until I'm back on campus in the new year unfortunately.