vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

GAF Format and CS:Z Tag #3725

Closed alexandrumeterez closed 2 years ago

alexandrumeterez commented 2 years ago

Hello! I'm using vg map to align some sequences to a graph and I would like to obtain a CIGAR string. From what I could find, I thought that the CS:Z tag encodes a CIGAR string, but I'm not sure how to interpret it since I can't really find an explanation. For example, in this CS:Z string, what does +, -, * and : encode?

+AGTCTAAGAAGTGTGGCCGCATAGTAACTCCAG:3*TA:1*TC:5-CC:7-GG*AT:3-AG:5-T:2-C:4*TA:10-T:3-TA:7-C:1-C:3*CT:1-CAA:5-A:8*AG:6-C:3-T:1-G:5-C:1-G:8-CG:4-C:2-CA-GG:2-AA:2-T:3-GG:2*AC:4-CA:23-TC:2-T:3-CG:4-G:1*TA:2-A:8-TA*CG:6-C:1-A:6-T:5-C:2-A:2-A:6-C:9-T*GA:12-GA:5+TA

And another question would be: are the alignments produced by vg map using soft clipping at the ends? Thanks!

alexandrumeterez commented 2 years ago

I think I found an explanation here. But what could be the reason why I'm getting such weird consecutive deletions in sequences e.g.: -GGGCCATAGTA-GGGCCATAGTAACCTC-TCTA

I have a full example here: https://dpaste.org/nhXTF For context, the data is a randomly generated string from which I build the graph, then I generate reference sequences by random walks through the graph. On these references sequences I apply mutations and call them queries. Then I try to align the query to the graph.

glennhickey commented 2 years ago

The cigars from vg are based on walks in the graph. If the walk specifies that two consecutive nodes are skipped over, it could appear as two consecutive deletions in the output This is necessary in GAM format (which vg still uses internally), but I agree that the GAF could look cleaner if it were smoothed together.

That said, it would be very odd to see so many deletions if you are aligning one string to a linear reference graph (which should have nodes much llonger than your alignment implies.). If this is the case, please share the commands you used to create your graph and alignment and we can take a look.

alexandrumeterez commented 2 years ago

Thanks for the reply. The genome sequence is in a .fa file and it's just a randomly generated sequence of ACTG.

To build the graph I am calling: vg construct -r data/sequence.fa vg index -x data/sequence.xg -g data/sequence.gcsa -k 16 data/sequence.vg

And then to align: vg map -drop-full-l-bonus -f data/queries.fa -x data/sequence.xg -g data/sequence.gcsa > output_file.gam

And I take the cs:Z tag by calling: vg view -a output_file.gam

Is there any other way to obtain a CIGAR string for the alignment? Or at least a way to get the clipping applied to the alignment start/end? This is mostly what I'm after.

jeizenga commented 2 years ago

I'm not sure if this is the issue, but the query sequence is long enough to trigger vg map's chunked alignment routine, which can produce some confusing alignment results. You might get better alignments with -w 300, or by using vg mpmap -n DNA -F GAF

alexandrumeterez commented 2 years ago

When doing vg mpmap -n DNA -F GAF I do get a more reasonable cs:Z tag.

But I'm still wondering if this contains soft clipping at the ends and if it's possible to print it, similar to how a regular CIGAR string has S.

glennhickey commented 2 years ago

trigger vg map's chunked alignment routine

aha, that explains it. I was just looking at the GAM, and there are multiple consecutive mappings on the same node. I'm reading the GAF code again and it seems entirely based on the assumption that this does not happen (it expects an explicit deletion). The result is a completely invalid GAF.

alexandrumeterez commented 2 years ago

trigger vg map's chunked alignment routine

aha, that explains it. I was just looking at the GAM, and there are multiple consecutive mappings on the same node. I'm reading the GAF code again and it seems entirely based on the assumption that this does not happen (it expects an explicit deletion). The result is a completely invalid GAF.

Thanks for the help and the very quick replies! I'd like to close this issue, I just need to confirm one last thing:

1) Do the alignments reported by vg mpmap have soft clippings at the end? And if they do, how can I know how much there is on each end? 2) I'm trying to compare vg with a few other aligners. Is using mpmap instead of map a fair comparision? Is there anything different between the 2?

jeizenga commented 2 years ago

I can answer some of the questions regarding vg mpmap.

Do the alignments reported by vg mpmap have soft clippings at the end? And if they do, how can I know how much there is on each end?

Yes, there are softclips. I'm not personally familiar with how the CS-style CIGAR strings work though, but maybe @glennhickey can fill you in there.

I'm trying to compare vg with a few other aligners. Is using mpmap instead of map a fair comparision? Is there anything different between the 2?

It's a fair enough comparison. vg mpmap is more built-out as an RNA-seq aligner, but it works for genomic sequence data as well. In general, it tends to be several times faster than vg map but also somewhat less sensitive. vg mpmap also tends to handle moderately long sequences more gracefully than vg map (my sense it that it can handle up to several kbp of Q>20 sequence pretty well).

alexandrumeterez commented 2 years ago

@glennhickey thanks for the quick fix! Should I switch to using vg mpmap or wait until the merge and use vg map? Also, could you please help on the CIGAR string and soft clippings question?

glennhickey commented 2 years ago

Any soft clips would just show up as insertions in GAF/GAM.

I believe GraphAligner remains the state of the art for mapping longer reads. vg giraffe will be updated with proper long read support in a future release.

Maybe you can try both map and mpmap for your comparison?

alexandrumeterez commented 2 years ago

@glennhickey Would it be possible to provide an x64 compiled version with this fix? I can't get vg to compile.

glennhickey commented 2 years ago

http://public.gi.ucsc.edu/~hickey/vg.8a763bb9b12d549c3daf195499d662d73e778c77