Issues with generating GCSA2 index

RenzoTale88 commented 4 years ago

Hello, I'm contacting you because I'm having troubles in generating the graph index. I have generated chromosome-wise multiple alignments between 5 different assemblies using cactus, converted them to vg using hal2vg, and processed them as described in here https://github.com/vgteam/sv-genotyping-paper/issues/6#issue-502964225.

Everything went fine until I had to generate the GCSA index for the genome. I've proceeded as follow:

export TMPDIR=`pwd`

for i in seq(1 N); do
    hal2vg --noAncestors --refGenome myref chr${i}.hal > tmp${i}.vg
    vg mod -X 32 tmp${i}.vg > chr${i}.final.vg
done

# Generate IDs
vg ids -j $(for i in $(seq 1 N); do echo chr${i}.final.vg; done)

# Generate XG index
vg index -x all.xg $(for i in seq(1 N); do echo chr${i}.final.vg ; done)

# Prune 
for i in seq(1 N); do
    vg prune -r chr${i}.final.vg > chr${i}.final.pruned.vg
done

# Create GCSA2
vg index -Z 4096 -t 32 -g all.gcsa $(for i in $(seq 1 N); do echo chr${i}.final.pruned.vg; done)

For the final step I'm running it on a cluster assigning 32 cores and 1Tb of memory. The process have a maximum memory usage of 780Gb of ram, after which if fails of processing the data with the error reported below:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.

Attached, you can find the stderr I'm getting.

Thank you for your support

Andrea

LOG.txt

jltsiren commented 4 years ago

I think the issue is with the vg prune command. If the graph is based on multiple alignment of sequences, and the sequences are embedded as paths, option -r will undo the pruning, because it restores all nodes and edges used by the paths.

You should be able to use the "many paths" approach described in the wiki.

Because you have already run vg ids -j, you should not rerun it. Instead, you just generate an empty node mapping file.

vg ids -m mapping $(for i in $(seq 1 N); do echo chr${i}.final.vg; done)
cp mapping mapping.backup

Then you prune the graph using the -u option, which unfolds the paths in pruned regions. The node mapping file is used to keep track of duplicate nodes.

cp mapping.backup mapping
for i in seq(1 N); do
  vg prune -u -a -m mapping chr${i}.final.vg > chr${i}.final.pruned.vg
done

Then in the vg index command, you use the node mapping to make the index map the duplicate nodes to their original ids.

vg index -Z 4096 -t 32 -g all.gcsa -f mapping $(for i in $(seq 1 N); do echo chr${i}.final.pruned.vg; done)

RenzoTale88 commented 4 years ago

Thank you for your suggestions, I'm trying that solution now. I'll report back to this once it has run and finished.

Andrea

RenzoTale88 commented 4 years ago

I'm running the command to generate the GCSA index as above recommended. I have a question concerning the parallelization of the job. Accoridng to the wiki page, the process is multithreaded through OpenMP https://github.com/vgteam/vg/wiki/Index-Construction. However, when I htop the node with the index generation, I can see that few threads are waiting for data, and most of them are suspended. Is it related to the use of the duplicate node mapping? Or is it that just some parts of the generation of the index are threaded, and most of the run time involves reading the input data?

Sorry for the probably simple question, but it might help me in better scheduling my jobs in the future.

Andrea

jltsiren commented 4 years ago

There are many interleaved phases in GCSA2 construction. Some are fully parallel, while others are fully sequential. Some are parallel, but CPU utilization depends heavily on disk speed. There are also some phases that are mostly sequential but have a separate thread assigned for reading each temporary file.

RenzoTale88 commented 4 years ago

Hi, sorry for the delay. I've managed to generate the indexes as you recommended. This is the list of steps I've run to generate the indexes starting from the HAL archive:

export TMPDIR=`pwd`

for i in seq(1 N); do
    hal2vg --noAncestors --refGenome myref chr${i}.hal > tmp${i}.vg
    vg mod -X 32 tmp${i}.vg > chr${i}.final.vg
done

# Fix IDs and generate empty map of nodes
vg ids -j $(for i in $(seq 1 N); do echo chr${i}.final.vg; done)
vg ids -m mapping $(for i in $(seq 1 N); do echo chr${i}.final.vg; done)

# Generate XG index
vg index -x all.xg $(for i in seq(1 N); do echo chr${i}.final.vg ; done)

# Prune unfolding nodes
for i in seq(1 N); do
    vg prune -t $NSLOTS -u -a -m mapping chr${i}.final.vg > chr${i}.final.pruned.vg
done

# Create GCSA2
vg index -p -Z 4096 -t 12 -f mapping -g all.gcsa $(for i in $(seq 1 N); do echo chr${i}.final.pruned.vg; done)

Could you please confirm me that I'm proceeding in the correct way? Does the addition of the mapping imply some changes to downstream analyses (i.e. mapping, augmenting, chunking etc.)?

Thank you again for your help Andrea

jltsiren commented 4 years ago

You can combine the two vg ids commands to save some time:

vg ids -j -m mapping $(for i in $(seq 1 N); do echo chr${i}.final.vg; done)

Running the -m command separately is only necessary if you have already joined the node id spaces with -j.

Otherwise everything looks correct.

The node mapping is necessary to make the GCSA2 index work correctly after pruning with option -u. Without it, the index would map kmers to positions that do not exist in the graph.

RenzoTale88 commented 4 years ago

Ok after some long testing I've managed to generate the indexes so I'm just going to close this issue. Thank again for the support.

vgteam / vg

Issues with generating GCSA2 index #2514