vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

forwardize_breakpoints error: failure,position 135211-3 is not inside node 135211 #3823

Open JD12138 opened 1 year ago

JD12138 commented 1 year ago

Hello, I used the HPRC pangenome (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph/hprc-v1.0-minigraph-grch38.gfa.gz) to do the analysis. My processes to do the analysis are listed below.

  1. build an index (vg autoindex -g hprc-v1.0-minigraph-grch38.gfa -T ./tmp )
  2. map my assembled contigs to the graph. (vg map -F my.contigs.fa -x hprc-v1.0-minigraph-grch38.xg -g hprc-v1.0-minigraph-grch38.csa >aln.gam)
  3. convert .gfa to .vg (vg convert -g hprc-v1.0-minigraph-grch38.gfa -v >hprc-v1.0-minigraph-grch38.vg)
  4. augment the .vg graph with my .gam file. (vg augment hprc-v1.0-minigraph-grch38.vg -s aln.gam >aug.vg)

Then I got the error message: *forwardize_breakpoints error: failure, position 135211-3 is not inside node 135211 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_NTCL3r/stacktrace.txt Please include the stack trace file in your bug report!**

Then I validate the gam file (vg validate hprc-v1.0-minigraph-grch38.vg -a aln.gam). The output file report so many invalid alignment. But I use the same file to do the mapping. Could you tell me the reason? Thank you so much!

jeizenga commented 1 year ago

I think I know what's going on, and it's a common point of confusion that we should really do something about. When sequences are significantly longer than short reads, vg map breaks the sequences into chunks and maps them independently, which can lead to discontiguous alignments. Many of the tools in the vg toolkit require contiguous alignments. For moderately long sequences, vg mpmap -n dna works pretty well, and it only produces contiguous alignments. You can also use GraphAligner for very long sequences.

JD12138 commented 1 year ago

I have tried before to align with GraphAligner to get the gam file. Then using vg augment to embed the gam into my graph. But the same error was reported. Then I use vg validate. There are 89 invalid alignments. But I use the same vg file in GraphAligner and vg augment. I reported this issue at https://github.com/vgteam/vg/issues/3717 . Someone else has the same question as me.

Zoeyoungxy commented 1 year ago

I met the same questions when using vg augment , and I got the error message: forwardize_breakpoints error: failure, position 72522423+152 is not inside node 72522423 forwardize_breakpoints error: failure, position 66398702+10 is not inside node 66398702 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_m9CpNU/stacktrace.txt Please include the stack trace file in your bug report!

The gam file was generated using GraphAligner as well. Have you solved the problem? Any suggestions will be appreciated. Thank You.

smriti-135 commented 3 months ago

Hello. I am facing the same error with slightly different steps (vg giraffe instead of map). Should i post my errors and steps here or open a separate issue?

edit: i have done vg convert -p of both gfa and gbz files into pg. while using vg validate, only the pg converted from gbz shows valid. i'm confused if i need to rerun the pipeline or continue with gbz

jltsiren commented 3 months ago

@smriti-135 Many vg tools require that nodes are no longer than 1024 bp, for various practical reasons. Most published pangenome graphs have longer nodes, at least in the GFA file. If you use vg autoindex with a GFA graph or convert the GFA to GBZ with vg gbwt, the long nodes will be chopped to a sequence of shorter nodes (32 bp for vg autoindex, 1024 bp for GBZ). But if you convert the GFA to a non-GBZ graph, the long nodes will remain.

This means that you can get three different graphs from the same GFA graph, depending on the commands you use. These graphs are not directly compatible with each other, and they cannot be mixed in the same pipeline. You should use the graph you get from vg autoindex (or vg gbwt) as the basis for further work instead of using the original GFA.

smriti-135 commented 2 months ago

@jltsiren i had carried forward with vg autoindex as follows:

vg autoindex --workflow giraffe --prefix pangenome_graph --ref-fasta teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa --vcf maf135.vcf --target-mem 12GB --threads 8
vg giraffe -Z pangenome_graph.giraffe.gbz -m pangenome_graph.min -d pangenome_graph.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz > ANP1.gam -p

vg convert -p pangenome_graph.giraffe.gbz -t 8 > pangenome_graph_gbz_vc.pg
vg augment pangenome_graph_gbz_vc.pg ANP1.gam -A ANP1_vc.aug.gam -m 2 -q 10 -v -t 16 > ANP1_graph_gbz_vc.aug.pg -p
vg snarls ANP1_graph_gbz_vc.aug.pg > ANP1_snarls_gbz_vc.snarls
vg pack -x ANP1_graph_gbz_vc.aug.pg -g ANP1_vc.aug.gam -Q 5 -t 16 -o ANP1_genotype_gbz_vc.aug.pack

vg call ANP1_graph_gbz_vc.aug.pg -r ANP1_snarls_gbz_vc.snarls -k ANP1_genotype_gbz_vc.aug.pack -t 8 > ANP1_variants_gbz_vc_4.aug.vcf

The resulting vcf (723 MB) does not seem to contain SVs. I'm adding a snippet of the file

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
Pseudomolecule_01   6203    >5777134>23818067   A   G   49.499  lowdepth    AT=>5777134>23818066>23818067,>5777134>30505148>23818067;DP=2   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.447250,-1.451595,-1.150565:3:-1.504111:0.795580:2
Pseudomolecule_01   6246    >5777146>5777151    T   C   49.499  lowdepth    AT=>5777146>5777148>5777150>5777151,>5777146>5777148>5777149>5777151;DP=2   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.170265,-1.174610,-0.873579:3:-2.313027:1.924528:2
Pseudomolecule_01   6268    >5777154>25635207   C   T   49.499  lowdepth    AT=>5777154>25635206>25635207,>5777154>30505147>25635207;DP=2   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01   6284    >5777160>25130643   CAT TCG 49.499  lowdepth    AT=>5777160>25130642>25130643,>5777160>30505146>25130643;DP=2   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01   6914    >5777219>26368344   T   C   49.499  lowdepth    AT=>5777219>26368343>26368344,>5777219>31277985>26368344;DP=2   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.169738,-1.174083,-0.873052:3:-1.504111:1.968750:2
Pseudomolecule_01   8212    >5777294>5777297    A   G   26.4111 lowad   AT=>5777294>5777296>5777297,>5777294>5777295>5777297;DP=1   GT:DP:AD:GL:GQ:GP:XD:MAD    1/0:1:0,1:-2.736928,-0.739100,-1.080097:3:-1.481213:0.916667:0
Pseudomolecule_01   9110    >5777355>26518925   A   T   9.54243 lowad   AT=>5777355>26518924>26518925,>5777355>30556276>26518925;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.353163,-0.353163,-0.353163:0:-2.197225:0.813187:0
Pseudomolecule_01   9120    >5777356>24850492   T   C   9.54243 lowad   AT=>5777356>24850491>24850492,>5777356>30556277>24850492;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.477189,-0.477189,-0.477189:0:-2.197225:1.098765:0
Pseudomolecule_01   9134    >24850492>5777359   G   T   49.499  lowdepth    AT=>24850492>5777358>5777359,>24850492>29461530>5777359;DP=2    GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:2:0,2:-5.298480,-1.302825,-1.001794:3:-1.504111:1.098765:2
Pseudomolecule_01   12073   >5777458>5777459    A   AG  12.5346 PASS    AT=>5777458>5777459,>5777458>30452372>5777459;DP=7  GT:DP:AD:GL:GQ:GP:XD:MAD    0/1:7:6,1:-2.989001,-2.292207,-13.899537:6:-1.281771:4.817708:1
Pseudomolecule_01   12621   >5777480>23306363   G   A   33.6934 PASS    AT=>5777480>23306362>23306363,>5777480>30235750>23306363;DP=7   GT:DP:AD:GL:GQ:GP:XD:MAD    0/1:7:5,2:-4.531480,-1.639815,-11.327922:28:-1.099895:7.661539:2
Pseudomolecule_01   12687   >5777482>26018630   A   T   7.17049 PASS    AT=>5777482>26018629>26018630,>5777482>28697073>26018630;DP=10  GT:DP:AD:GL:GQ:GP:XD:MAD    0/1:10:9,1:-2.814530,-2.946761,-20.198554:1:-1.616188:10.761194:1
Pseudomolecule_01   12690   >26018630>26018632  A   G   28.0416 PASS    AT=>26018630>26018631>26018632,>26018630>30235751>26018632;DP=9 GT:DP:AD:GL:GQ:GP:XD:MAD    0/1:9:7,2:-4.444100,-2.119110,-15.802872:23:-1.103333:10.761194:2
Pseudomolecule_01   12781   >5777490>27070101   TAAAAAAAA   AAAAAAAAA,T 209.761 PASS    AT=>5777490>27070099>27070100>27070101,>5777490>29068302>27070100>27070101,>5777490>27070099>27070101;DP=13 GT:DP:AD:GL:GQ:GP:XD:MAD    1/2:13:2,7,4:-25.097774,-8.895493,-14.888975,-10.349251,-4.899838,-15.075861:39:-1.791864:8.634561:4

Am I missing something? I also tried with the '-f' and '-c' options but only the file size decreased due to the filters.

jltsiren commented 2 months ago

@smriti-135 Did you have SVs in the original VCF? And did it contain phased haplotypes?

smriti-135 commented 2 months ago

@jltsiren saw that was the problem. It had only SNPs. But when I try to merge it with the one with SVs using bcftools merge ANP1.vcf.gz ANP1.delly.vcf.gz -o ANP1_merged.vcf.gz the resultant merged vcf does not recognise it as the same sample and instead adds it as a new column #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SO_11111_ANP1_SM 2:SO_11111_ANP1_SM any solutions for that?

Edit. I used a tool called delly to call the SVs and form a separate file ANP1.delly.vcf. When I repeat the above workflow with that ANP1.delly.vcf instead of maf135.vcf I am still getting an "empty" file with no SVs in spite of knowing it contains it!

jltsiren commented 1 month ago

I haven't had time to answer this, as there are so many things that can go wrong. My best guess is some kind of naming conflict between the VCF file(s) and the reference.

When you run the commands, do you get any warnings?

Does the SV-only graph actually contain any variants? What is the output of vg stats -z graph.gbz?

I'm not very familiar with vg call, and it's possible that there are some issues in the commands you run.

smriti-135 commented 1 month ago

When you run the commands, do you get any warnings?

No there were no warnings except a lot of this

warning:[vg::Constructor] Lowercase characters found in Scaffold_Un904; coercing to uppercase.

But when I tried again now, I got this now

error:[vg::Constructor] Variant/reference sequence mismatch: A vs pos: 20643581: T; do your VCF and FASTA coordinates match? Variant: Pseudomolecule_01 20643581 Pseudomolecule_01_20643581 A G 0 . PR zero ind: 20643580 1-indexed: 20643581

and there is no gbz output.

jltsiren commented 1 month ago

If you are getting different results in the graph construction part, it's likely that you either have corrupted files or other hardware errors or you were running different commands.

smriti-135 commented 1 month ago

was able to run vg stats pangenome_graph_ai.giraffe.gbz but its not giving any output