vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Negative end of path positions in vg giraffe mapping #4249

Open iamdeaardappel opened 3 months ago

iamdeaardappel commented 3 months ago

1. What were you trying to do?

I am working on a GAF parser, and currently trying to calculate alignment lengths. Alignment block length (column 11) typically mentions 151, even when the actual alignment is shorter. To obtain the length, I subtract the end position on the path (GFA column 9) from the start position (column 8), which gives me the correct length in 99.9% of the mappings. However, in some cases, the end position is lower than the start position. Sometimes, the end position is negative. I have two examples where I suspect the end position is incorrect.

2. What did you want to happen?

I expected end positions to be higher, at least not be negative.

3. What actually happened?

Negative end position example. Alignment starts at 0 and ends at -273. I can’t seem to find any information about the end being negative, also the path is only 79 bases long.

ST-E00288:114:H3NJ5ALXX:3:1102:13250:34817:1    151 0   151 +   >49055347>49055348>49055350>49055351>49055353>49055354>49055356>49055357>49055359>49055360>49055361>49055364>49055365>49055366>49055380>49055381>49055382>49055384>49055385>49055387>49055388>49055390>49055391>49055393>49055395>49055396>49055398>49055400>49055401>49056185>49056186>49056196>49056197>49056198>49056201>49056202>49056205>49056206>49056208>49056209>49056211   79  0   -273    79  151 60  AS:i:84 bq:Z:AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ    cs:Z::79+ATTTGCATTGCGAGATTTATTAGTAAGTGAAATTTTCAATAGAATTTTTCTTATATTTTAGGCTCAATTCTG   dv:f:0.4768 fn:Z:ST-E00288:114:H3NJ5ALXX:3:1102:13250:34817:2   pd:b:1

Example where coordinates don't seem to match. Unsure if this second example is a bug, but I’m unable to explain to obtained coordinates. Alignment starts at 1274 and ends 955., I thought maybe the read is mapped in reverse. But looking at the cigar, there is no large insertion that explains why the alignment length is 319 bases. A length close to 151 is to be expected.

ST-E00288:114:H3NJ5ALXX:3:1101:27011:5774:1     151     0       151     +       >105708692      1414    1274    955     139     151     60      AS:i:140        bq:Z:AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJ<JJJF<AAF--A7<--7F<<------7 cs:Z::128*TC:11+CCCCCCGCCGC     dv:f:0.0795     fn:Z:ST-E00288:114:H3NJ5ALXX:3:1101:27011:5774:2        pd:b:1

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

-

5. What data and command can the vg dev team use to make the problem happen?

source venv-cactus-v2.6.8/bin/activate
cactus-pangenome ./js genomes.txt --outDir potato-pg --outName potato-pg --reference DM8 --vcf --giraffe --gfa --gbz --maxCores 48
./vg giraffe --named-coordinates -o gaf -t 12 -Z cactus/potato-pg/potato-pg.d2.gbz -m cactus/potato-pg/potato-pg.d2.min -d cactus/potato-pg/potato-pg.d2.dist -f XXX_L003_R1_001.fastq.gz  -f XXX_L003_R2_001.fastq.gz > run.gaf

6. What does running vg version say?

vg version v1.51.0 "Quellenhof"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu

Thanks for reading!

ASLeonard commented 3 months ago

I saw some similar issues that were fixed in #3996, but came across further errors in #4008 which remain unresolved. In general the --named-coordinates flag seems to be troublesome, but it is definitely nicer to use for a lot of applications.