Closed alexandrumeterez closed 2 years ago
I think I found an explanation here. But what could be the reason why I'm getting such weird consecutive deletions in sequences e.g.: -GGGCCATAGTA-GGGCCATAGTAACCTC-TCTA
I have a full example here: https://dpaste.org/nhXTF For context, the data is a randomly generated string from which I build the graph, then I generate reference sequences by random walks through the graph. On these references sequences I apply mutations and call them queries. Then I try to align the query to the graph.
The cigars from vg are based on walks in the graph. If the walk specifies that two consecutive nodes are skipped over, it could appear as two consecutive deletions in the output This is necessary in GAM format (which vg still uses internally), but I agree that the GAF could look cleaner if it were smoothed together.
That said, it would be very odd to see so many deletions if you are aligning one string to a linear reference graph (which should have nodes much llonger than your alignment implies.). If this is the case, please share the commands you used to create your graph and alignment and we can take a look.
Thanks for the reply. The genome sequence is in a .fa file and it's just a randomly generated sequence of ACTG.
To build the graph I am calling:
vg construct -r data/sequence.fa
vg index -x data/sequence.xg -g data/sequence.gcsa -k 16 data/sequence.vg
And then to align:
vg map -drop-full-l-bonus -f data/queries.fa -x data/sequence.xg -g data/sequence.gcsa > output_file.gam
And I take the cs:Z tag by calling:
vg view -a output_file.gam
Is there any other way to obtain a CIGAR string for the alignment? Or at least a way to get the clipping applied to the alignment start/end? This is mostly what I'm after.
I'm not sure if this is the issue, but the query sequence is long enough to trigger vg map
's chunked alignment routine, which can produce some confusing alignment results. You might get better alignments with -w 300
, or by using vg mpmap -n DNA -F GAF
When doing vg mpmap -n DNA -F GAF
I do get a more reasonable cs:Z tag.
But I'm still wondering if this contains soft clipping at the ends and if it's possible to print it, similar to how a regular CIGAR string has S.
trigger
vg map
's chunked alignment routine
aha, that explains it. I was just looking at the GAM, and there are multiple consecutive mappings on the same node. I'm reading the GAF code again and it seems entirely based on the assumption that this does not happen (it expects an explicit deletion). The result is a completely invalid GAF.
trigger
vg map
's chunked alignment routineaha, that explains it. I was just looking at the GAM, and there are multiple consecutive mappings on the same node. I'm reading the GAF code again and it seems entirely based on the assumption that this does not happen (it expects an explicit deletion). The result is a completely invalid GAF.
Thanks for the help and the very quick replies! I'd like to close this issue, I just need to confirm one last thing:
1) Do the alignments reported by vg mpmap
have soft clippings at the end? And if they do, how can I know how much there is on each end?
2) I'm trying to compare vg with a few other aligners. Is using mpmap instead of map a fair comparision? Is there anything different between the 2?
I can answer some of the questions regarding vg mpmap
.
Do the alignments reported by vg mpmap have soft clippings at the end? And if they do, how can I know how much there is on each end?
Yes, there are softclips. I'm not personally familiar with how the CS
-style CIGAR strings work though, but maybe @glennhickey can fill you in there.
I'm trying to compare vg with a few other aligners. Is using mpmap instead of map a fair comparision? Is there anything different between the 2?
It's a fair enough comparison. vg mpmap
is more built-out as an RNA-seq aligner, but it works for genomic sequence data as well. In general, it tends to be several times faster than vg map
but also somewhat less sensitive. vg mpmap
also tends to handle moderately long sequences more gracefully than vg map
(my sense it that it can handle up to several kbp of Q>20 sequence pretty well).
@glennhickey thanks for the quick fix!
Should I switch to using vg mpmap
or wait until the merge and use vg map
?
Also, could you please help on the CIGAR string and soft clippings question?
Any soft clips would just show up as insertions in GAF/GAM.
I believe GraphAligner remains the state of the art for mapping longer reads. vg giraffe
will be updated with proper long read support in a future release.
Maybe you can try both map
and mpmap
for your comparison?
@glennhickey Would it be possible to provide an x64 compiled version with this fix? I can't get vg to compile.
Hello! I'm using vg map to align some sequences to a graph and I would like to obtain a CIGAR string. From what I could find, I thought that the CS:Z tag encodes a CIGAR string, but I'm not sure how to interpret it since I can't really find an explanation. For example, in this CS:Z string, what does +, -, * and : encode?
+AGTCTAAGAAGTGTGGCCGCATAGTAACTCCAG:3*TA:1*TC:5-CC:7-GG*AT:3-AG:5-T:2-C:4*TA:10-T:3-TA:7-C:1-C:3*CT:1-CAA:5-A:8*AG:6-C:3-T:1-G:5-C:1-G:8-CG:4-C:2-CA-GG:2-AA:2-T:3-GG:2*AC:4-CA:23-TC:2-T:3-CG:4-G:1*TA:2-A:8-TA*CG:6-C:1-A:6-T:5-C:2-A:2-A:6-C:9-T*GA:12-GA:5+TA
And another question would be: are the alignments produced by vg map using soft clipping at the ends? Thanks!