vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

vg giraffe invalid alignment: Edit erroneously claims mismatch on node #4410

Closed ryandkuster closed 2 weeks ago

ryandkuster commented 2 weeks ago

1. What were you trying to do?

Align illumina short reads with vg giraffe to produce a gam for pack/call steps.

2. What did you want to happen?

I wanted a valid .gam file.

3. What actually happened?

The resulting .gam file is invalid.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Running vg validata -a on the .gam and graph used for alignment yields the following type of message for many samples:

Invalid Alignment:
{"annotation": {"fragment_length": 194, "fragment_length_distribution": "-I 344.491579 -D 79.505853", "last_placed_stage": "none", "last_placed_stage_0bp": "none", "mapq_applied_cap": 0.7918124604762482, "mapq_explored_cap": 2.3259851532217066, "mapq_score_group": 2147483647, "mapq_uncapped": 54, "proper_pair": true, "rescuer": true, "secondary_scores": [87.704993663851042, 78.142053984938102, 77.790647025226036, 77.169841752879336, 76.810891790233569, 8]}, "fragment_prev": {"name": "LH00328:327:2277LMLT4:8:2340:41494:22572"}, "identity": 0.56060606060606055, "name": "LH00328:327:2277LMLT4:8:2340:41494:22572", "path": {"mapping": [{"edit": [{"sequence": "TTGGTGGGGGGCGCG", "to_length": 15}, {"from_length": 28, "to_length": 28}], "position": {"node_id": "51551762", "offset": "1"}, "rank": "1"}, {"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": "51551763"}, "rank": "1"}, {"edit": [{"from_length": 7, "to_length": 7}], "position": {"node_id": "51551765"}, "rank": "2"}, {"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": "51551767"}, "rank": "3"}, {"edit": [{"from_length": 1, "sequence": "N", "to_length": 1}, {"from_length": 1, "sequence": "T", "to_length": 1}, {"from_length": 1, "sequence": "T", "to_length": 1}, {"from_length": 1, "sequence": "T", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "A", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "G", "to_length": 1}, {"from_length": 1, "sequence": "C", "to_length": 1}, {"from_length": 1, "sequence": "C", "to_length": 1}], "position": {"node_id": "51551769"}, "rank": "4"}]}, "quality": "KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKAIoKCgoKCgoKCgoKCgo", "score": 42, "sequence": "TTGGTGGGGGGCGCGGGCGAAGTTCGTCCGCCGGACTCGGAATGGGGCGAGANTTTGCGAGGGGCC", "time_used": 0.000746446}
Edit erroneously claims mismatch on node 51551769 between node position 0 and edit 0, position 0 on forward strand 

5. What data and command can the vg dev team use to make the problem happen?

The pangenome was built per-chromosome with PGGB. At the various stages of converting to .og and .gfa both odgi validate and vg validate give no errors. The vg giraffe step has also been aligned using the XG graph and a .gbz produced without the --vg-algorithm conversion.

odgi squeeze --optimize -f chr_paths.txt -o squeeze.og
odgi view -i squeeze.og -g > squeeze.gfa
vg convert -g squeeze.gfa -f -t 20 --vg-algorithm > vg_alg.gfa
vg autoindex -R XG -t 20 --workflow giraffe --prefix vg_alg --gfa vg_alg.gfa
vg giraffe -Z vg_alg.gbz -m vg_alg.min -d vg_alg.dist -t 24 -f R1.fq -f R2.fq > giraffe.gam

6. What does running vg version say?

vg version v1.59.0 "Casatico"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu

If there are any troubleshooting steps I should consider or steps to modify I would appreciate any advice. I thank you all sincerely for your time and effort with vg!

jeizenga commented 2 weeks ago

It looks to me like this is a problem with vg validate, not vg giraffe. The read and reference bases are both N, and we generally don't consider two N's to be "matching", since N is only used as a placeholder for an unknown base. I think you should be able to use the GAM file without issues.

ryandkuster commented 2 weeks ago

@jeizenga Thanks for the quick response!

That's reassuring that the alignments may in fact be valid. The .gam files were my first step in troubleshooting a vg pack error that has been cropping up in only a small percentage of my samples.

I've been running into an error that usually takes the following form:

vg: src/packer.cpp:883: size_t vg::Packer::average_node_quality(size_t) const: Assertion `total_node_quality(i) == 0' failed.
━━━━━━━━━━━━━━━━━━━━                                                            
Crash report for vg v1.59.0 "Casatico"                                          
Stack trace (most recent call last) in thread 2437524:                          
#10   Object "", at 0xffffffffffffffff, in                                      
#9    Object "vg", at 0x21e9173, in __clone             
#8    Object "vg", at 0x214269a, in start_thread        
#7    Object "vg", at 0x20e4f2d, in gomp_thread_start   
#6    Object "vg", at 0x1291d70, in vg::Packer::make_compact() [clone ._omp_fn.6]
#5    Object "vg", at 0x129167f, in vg::Packer::average_node_quality(unsigned long) const
#4    Object "vg", at 0x2110e95, in __assert_fail       
#3    Object "vg", at 0x5ee653, in __assert_fail_base.cold
#2    Object "vg", at 0x5ee72b, in abort                
#1    Object "vg", at 0x21174a5, in raise               
#0    Object "vg", at 0x2143ffc, in __pthread_kill      
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!                        
━━━━━━━━━━━━━━━━━━━━        

I'm following a few of the troubleshooting steps suggested in #4355 , which may not be identical to my own but I didn't want to open a new issue if the alignment was the cause.