Open JD12138 opened 1 year ago
I think I know what's going on, and it's a common point of confusion that we should really do something about. When sequences are significantly longer than short reads, vg map
breaks the sequences into chunks and maps them independently, which can lead to discontiguous alignments. Many of the tools in the vg
toolkit require contiguous alignments. For moderately long sequences, vg mpmap -n dna
works pretty well, and it only produces contiguous alignments. You can also use GraphAligner for very long sequences.
I have tried before to align with GraphAligner to get the gam file. Then using vg augment to embed the gam into my graph. But the same error was reported. Then I use vg validate. There are 89 invalid alignments. But I use the same vg file in GraphAligner and vg augment. I reported this issue at https://github.com/vgteam/vg/issues/3717 . Someone else has the same question as me.
I met the same questions when using vg augment , and I got the error message: forwardize_breakpoints error: failure, position 72522423+152 is not inside node 72522423 forwardize_breakpoints error: failure, position 66398702+10 is not inside node 66398702 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_m9CpNU/stacktrace.txt Please include the stack trace file in your bug report!
The gam file was generated using GraphAligner as well. Have you solved the problem? Any suggestions will be appreciated. Thank You.
Hello. I am facing the same error with slightly different steps (vg giraffe instead of map). Should i post my errors and steps here or open a separate issue?
edit: i have done vg convert -p of both gfa and gbz files into pg. while using vg validate, only the pg converted from gbz shows valid. i'm confused if i need to rerun the pipeline or continue with gbz
@smriti-135 Many vg tools require that nodes are no longer than 1024 bp, for various practical reasons. Most published pangenome graphs have longer nodes, at least in the GFA file. If you use vg autoindex
with a GFA graph or convert the GFA to GBZ with vg gbwt
, the long nodes will be chopped to a sequence of shorter nodes (32 bp for vg autoindex
, 1024 bp for GBZ). But if you convert the GFA to a non-GBZ graph, the long nodes will remain.
This means that you can get three different graphs from the same GFA graph, depending on the commands you use. These graphs are not directly compatible with each other, and they cannot be mixed in the same pipeline. You should use the graph you get from vg autoindex
(or vg gbwt
) as the basis for further work instead of using the original GFA.
@jltsiren i had carried forward with vg autoindex
as follows:
vg autoindex --workflow giraffe --prefix pangenome_graph --ref-fasta teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa --vcf maf135.vcf --target-mem 12GB --threads 8
vg giraffe -Z pangenome_graph.giraffe.gbz -m pangenome_graph.min -d pangenome_graph.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz > ANP1.gam -p
vg convert -p pangenome_graph.giraffe.gbz -t 8 > pangenome_graph_gbz_vc.pg
vg augment pangenome_graph_gbz_vc.pg ANP1.gam -A ANP1_vc.aug.gam -m 2 -q 10 -v -t 16 > ANP1_graph_gbz_vc.aug.pg -p
vg snarls ANP1_graph_gbz_vc.aug.pg > ANP1_snarls_gbz_vc.snarls
vg pack -x ANP1_graph_gbz_vc.aug.pg -g ANP1_vc.aug.gam -Q 5 -t 16 -o ANP1_genotype_gbz_vc.aug.pack
vg call ANP1_graph_gbz_vc.aug.pg -r ANP1_snarls_gbz_vc.snarls -k ANP1_genotype_gbz_vc.aug.pack -t 8 > ANP1_variants_gbz_vc_4.aug.vcf
The resulting vcf (723 MB) does not seem to contain SVs. I'm adding a snippet of the file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Pseudomolecule_01 6203 >5777134>23818067 A G 49.499 lowdepth AT=>5777134>23818066>23818067,>5777134>30505148>23818067;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.447250,-1.451595,-1.150565:3:-1.504111:0.795580:2
Pseudomolecule_01 6246 >5777146>5777151 T C 49.499 lowdepth AT=>5777146>5777148>5777150>5777151,>5777146>5777148>5777149>5777151;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.170265,-1.174610,-0.873579:3:-2.313027:1.924528:2
Pseudomolecule_01 6268 >5777154>25635207 C T 49.499 lowdepth AT=>5777154>25635206>25635207,>5777154>30505147>25635207;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01 6284 >5777160>25130643 CAT TCG 49.499 lowdepth AT=>5777160>25130642>25130643,>5777160>30505146>25130643;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.185855,-1.190200,-0.889169:3:-1.504111:1.637931:2
Pseudomolecule_01 6914 >5777219>26368344 T C 49.499 lowdepth AT=>5777219>26368343>26368344,>5777219>31277985>26368344;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.169738,-1.174083,-0.873052:3:-1.504111:1.968750:2
Pseudomolecule_01 8212 >5777294>5777297 A G 26.4111 lowad AT=>5777294>5777296>5777297,>5777294>5777295>5777297;DP=1 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:1:0,1:-2.736928,-0.739100,-1.080097:3:-1.481213:0.916667:0
Pseudomolecule_01 9110 >5777355>26518925 A T 9.54243 lowad AT=>5777355>26518924>26518925,>5777355>30556276>26518925;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.353163,-0.353163,-0.353163:0:-2.197225:0.813187:0
Pseudomolecule_01 9120 >5777356>24850492 T C 9.54243 lowad AT=>5777356>24850491>24850492,>5777356>30556277>24850492;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.477189,-0.477189,-0.477189:0:-2.197225:1.098765:0
Pseudomolecule_01 9134 >24850492>5777359 G T 49.499 lowdepth AT=>24850492>5777358>5777359,>24850492>29461530>5777359;DP=2 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:2:0,2:-5.298480,-1.302825,-1.001794:3:-1.504111:1.098765:2
Pseudomolecule_01 12073 >5777458>5777459 A AG 12.5346 PASS AT=>5777458>5777459,>5777458>30452372>5777459;DP=7 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:7:6,1:-2.989001,-2.292207,-13.899537:6:-1.281771:4.817708:1
Pseudomolecule_01 12621 >5777480>23306363 G A 33.6934 PASS AT=>5777480>23306362>23306363,>5777480>30235750>23306363;DP=7 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:7:5,2:-4.531480,-1.639815,-11.327922:28:-1.099895:7.661539:2
Pseudomolecule_01 12687 >5777482>26018630 A T 7.17049 PASS AT=>5777482>26018629>26018630,>5777482>28697073>26018630;DP=10 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:10:9,1:-2.814530,-2.946761,-20.198554:1:-1.616188:10.761194:1
Pseudomolecule_01 12690 >26018630>26018632 A G 28.0416 PASS AT=>26018630>26018631>26018632,>26018630>30235751>26018632;DP=9 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:9:7,2:-4.444100,-2.119110,-15.802872:23:-1.103333:10.761194:2
Pseudomolecule_01 12781 >5777490>27070101 TAAAAAAAA AAAAAAAAA,T 209.761 PASS AT=>5777490>27070099>27070100>27070101,>5777490>29068302>27070100>27070101,>5777490>27070099>27070101;DP=13 GT:DP:AD:GL:GQ:GP:XD:MAD 1/2:13:2,7,4:-25.097774,-8.895493,-14.888975,-10.349251,-4.899838,-15.075861:39:-1.791864:8.634561:4
Am I missing something? I also tried with the '-f' and '-c' options but only the file size decreased due to the filters.
@smriti-135 Did you have SVs in the original VCF? And did it contain phased haplotypes?
@jltsiren saw that was the problem. It had only SNPs. But when I try to merge it with the one with SVs using
bcftools merge ANP1.vcf.gz ANP1.delly.vcf.gz -o ANP1_merged.vcf.gz
the resultant merged vcf does not recognise it as the same sample and instead adds it as a new column
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SO_11111_ANP1_SM 2:SO_11111_ANP1_SM
any solutions for that?
Edit. I used a tool called delly to call the SVs and form a separate file ANP1.delly.vcf. When I repeat the above workflow with that ANP1.delly.vcf instead of maf135.vcf I am still getting an "empty" file with no SVs in spite of knowing it contains it!
I haven't had time to answer this, as there are so many things that can go wrong. My best guess is some kind of naming conflict between the VCF file(s) and the reference.
When you run the commands, do you get any warnings?
Does the SV-only graph actually contain any variants? What is the output of vg stats -z graph.gbz
?
I'm not very familiar with vg call
, and it's possible that there are some issues in the commands you run.
When you run the commands, do you get any warnings?
No there were no warnings except a lot of this
warning:[vg::Constructor] Lowercase characters found in Scaffold_Un904; coercing to uppercase.
But when I tried again now, I got this now
error:[vg::Constructor] Variant/reference sequence mismatch: A vs pos: 20643581: T; do your VCF and FASTA coordinates match? Variant: Pseudomolecule_01 20643581 Pseudomolecule_01_20643581 A G 0 . PR zero ind: 20643580 1-indexed: 20643581
and there is no gbz output.
If you are getting different results in the graph construction part, it's likely that you either have corrupted files or other hardware errors or you were running different commands.
was able to run vg stats pangenome_graph_ai.giraffe.gbz
but its not giving any output
Hello, I used the HPRC pangenome (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph/hprc-v1.0-minigraph-grch38.gfa.gz) to do the analysis. My processes to do the analysis are listed below.
Then I got the error message: *forwardize_breakpoints error: failure, position 135211-3 is not inside node 135211 vg: src/augment.cpp:514: std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > > vg::forwardize_breakpoints(const HandleGraph, const std::unordered_map<long long int, std::set<std::tuple<long long int, bool, long unsigned int> > >&): Assertion `false' failed. ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Stack trace path: /tmp/vg_crash_NTCL3r/stacktrace.txt Please include the stack trace file in your bug report!**
Then I validate the gam file (vg validate hprc-v1.0-minigraph-grch38.vg -a aln.gam). The output file report so many invalid alignment. But I use the same file to do the mapping. Could you tell me the reason? Thank you so much!