vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Conversion to SAM without surject realignment #4070

Open Chris7 opened 1 year ago

Chris7 commented 1 year ago

I'd like to be able to do a conversion of a GAM output to BAM without realignment against the surjected path. I have a little script that converts regions of a GAM file to a coverage based view (attached, the red bars are SNPs). Screenshot from 2023-09-03 09-52-06

For further debugging, I'd like to be able to zoom into the raw read support for a region via BAMs since people are much more familiar with those. Looking at the surject code, a realignment takes place so there will be discrepancies between the coverage view depth and the BAM depth. Is there an existing way to simply convert the GAM alignment into the SAM format? If not, it's something we'd be willing to write if we can get a bit of guidance on some of the nuances of the json-gam format (this is the format we use for generating the coverage graph).

jeizenga commented 1 year ago

The SAM/BAM formats require the alignment to be specified using a CIGAR string, which includes a full base-level alignment of the read to the reference. Because of this, I don't see an obvious way to circumvent the necessity of realigning the off-reference portions of the read in vg surject. You could perhaps create separate supplementary alignments for each reference-overlapping segment, but I don't think this is what most users want out of vg surject, and I don't think we have any plans to implement that pipeline.

If you're interested in digging into the GAM format (or it's JSON representation), you should look at the scheme described here: https://github.com/vgteam/libvgio/blob/master/deps/vg.proto#L38-L151

Chris7 commented 1 year ago

Thanks. To give a bit of context, we have been using graph alignment (vg in particular) to confirm strain engineering. We have a given strain and may carry out operations like large insertions, deletions, base pair changes, etc. These changes are all encoded in a graph and aligned against. By looking at the alignments over the parental vs. engineered regions we can verify what engineering was successful. The graph approach is incredibly useful as with a linear aligner we would have to invent heuristics to avoid all the systematic artifacts introduced (I can go more into these if you really care to hear). However, bench scientists don't have a lot of intuition and understanding of the kind of visuals the exist currently from graph aligners. One problem with surjection is seeing alignment changes between the graph alignment and surjected alignments caused confusion where read depth changes and new variants are sometimes introduced based on the new local alignment.

I managed to get what I need from the proto format (thank you very much!). Here's an example of what I have: Screenshot from 2023-09-05 07-19-27

Interface-wise, the command is invoked with a gam + a path/walk ID to filter by -- if a segment is not present in that reference path the CIGAR string is represented accordingly (for example, we have an insert here, so the missing segment gets a I, if we deleted it would get a D, etc.).

This is ok to close, but I thought I'd drop some information on what we're using graph alignment for since it's a bit different than what most literature covers (pangenomes).

adamnovak commented 10 months ago

We could just do a projection of the GAM alignment onto the reference path and save that as SAM. Where the read visits a base on the target path would be a match, and where it visits any other node would be... an insertion adjacent to a deletion, or something.

But almost everyone who wants a SAM actually wants a SAM expressing a plausible alignment of the read against the reference used for the SAM. And if you want to know how the read falls in the graph, you can read the GAM or (probably better) read the more standard GAF text format.

We can think about adding the feature. @Chris7 are you still having to generate your unusual-semantics SAM to get the visualizations you want?

Chris7 commented 10 months ago

I still generate them, they're very useful :). I think the part here:

But almost everyone who wants a SAM actually wants a SAM expressing a plausible alignment of the read against the reference used for the SAM

shows the issue with the realignment approach surjection takes. We are asking to see the alignment over a common, shared region of a reference vs. variant genome and not to be realigned without knowledge of that alternative path.

I think there are 2 problems here rolled into 1 command:

Graphs are also really new to most of the community, so this command is quite useful when I try to explain to end users what they are seeing. It's worth mentioning that the use here is in synthetic biology, where we are using graphs to confirm/debug engineering in ways linear aligners just cannot do. So it's understandable if this workflow seems very bizarre to you. If you want to learn more about it, it could be interesting to chat (drop a line at chris@ginkgobioworks.com if you do want to! Maybe we can identify parts of our internal stuff we've been using to process this data to open source).