vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Giraffe warning when running on cactus pangenome graph #3550

Closed RenzoTale88 closed 2 years ago

RenzoTale88 commented 2 years ago

Hello, I've generated a graph genome of five small assemblies (~220-250Mb genome size) using the cactus pangenome workflow. The workflow completes successfully, and generated a gg, gbwt and xg genome graphs, that I then used to generate the minimizer and distance indexes.When I try to run giraffe on the graph I get the following warning:

./vg giraffe -H GRAPH/final_uclip/pangenome_5asm.gbwt -g GRAPH/final_uclip/pangenome_5asm.gg -d GRAPH/final_uclip/pangenome_5asm.dist -m GRAPH/final_uclip/pangenome_5asm.min -p -t 12 --hit-cap 100 --hard-hit-cap 5000 --score-fraction 1.0 --max-extensions 10000 --max-alignments 1000 --cluster-score 0 --cluster-coverage 0 --extension-score 0 --extension-set 0 -R SAMPLE -N SAMPLE -f WGS/SAMPLE_1.fastq -f WGS/SAMPLE_2.fastq > WGS/SAMPLE.gam

Guessing that GRAPH/final_uclip/pangenome_5asm.xg is XG
Guessing that GRAPH/final_uclip/pangenome_5asm.giraffe.gbz is Giraffe GBZ
Initializing MinimizerMapper
Loading and initialization: 11.0757 seconds
Mapping reads to "-" (GAM)
--hit-cap 100
--hard-hit-cap 5000
--score-fraction 1
--max-extensions 10000
--max-alignments 1000
--cluster-score 0
--pad-cluster-score 0
--cluster-coverage 0
--extension-score 0
--extension-set 0
--max-multimaps 1
--distance-limit 200
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-seed-limit 100
--rescue-attempts 15
--rescue-algorithm dozeu
Not counting CPU instructions because perf events are unavailable: No such file or directory
Using fragment length estimate: 358.466 +/- 78.4354
warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 125 bp against 14566 bp ordered subgraph for read SAMPLE.164863 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.

My question is: does the warning means that some of the alignments to the 14Kb-long region will be lost? If so, how can I prevent it?

Thank you in advance, Andrea

jltsiren commented 2 years ago

That's a common warning, and it might be the best to just hide it from users.

The dynamic programming implementation we use can't align reads to arbitrarily large subgraphs. Even if it could, we would still abandon the attempt beyond some threshold, because it would require too much time for a single potential mapping of a single read.

Giraffe tries to use dynamic programming when a mapping looks promising enough but it can't extend any nearby seed to an alignment without gaps. If the relevant subgraph is too large and complex, Giraffe abandons the attempt. This happens more often in pair rescue, where the graph region is typically 500-1000 bp (but may contain tens of kilobases of sequence), than in the alignment phase, where the region is usually 200 bp or less.

If the underlying cause is an indel error in the read, other reads should align fine to that region. If the sequenced genome contains an indel in that region but the indel is not present in the graph, the issue could affect other reads containing the indel. It might be possible to avoid that by using graphs where complex regions are less collapsed and contain more duplicated sequence, but I don't think anyone has investigated that option.

RenzoTale88 commented 2 years ago

@jltsiren thank you for the explanation! I'll close this thread now then.

Andrea