Best method of finding reads aligned to non-reference path

JosephLalli commented 1 year ago

I'd like to extract reads from a .gam file that do not align to any nodes that are in a reference path. My end goal is to identify coverage and depth of likely insertions.

Commands that strike me as potentially helpful (though I can't find enough documentation to be sure): vg call? vg paths? vg pack? vg chunk? vg filter? vg view -> .gfa graph and use cut/grep to exclude reference nodes? vg view --snarls?

If it is simpler to extract reads attached to known paths, I can first use Pangenie to identify insertions that are likely present in my dataset. One possible workflow is Pangenie -> identify called insertions -> extract insertion sequences and reads aligning to that sequence -> format into bam file of insertions.

Any thoughts? This strikes me as a problem that someone has likely already solved - I just don't know who!

glennhickey commented 1 year ago

This is a good question . vg call will give you coverage of non-reference paths (in VCF format), but not the supporting reads themselves.

But you can extract reads from a GAM that do not align to any reference nodes with vg annotate and vg filter as follows:

Add reference positions to the GAM:

vg annotate -x <graph.vg> -a <aln.gam> -m > annotated.gam

Remove any read that aligns to a reference position from the GAM:

vg filter annotated.gam -X <reference contig regex>

JosephLalli commented 1 year ago

Thank you @glennhickey, that works perfectly!

I have a related question for you: When I surject a GAM file (for example, against, hg38), what happens to the reads whose alignments do not overlap a single hg38 node? They are declared "unmapped" in the output bam file, correct?

My ultimate goal is to capture these reads in the surjection output. I can run vg giraffe -> vg pack -> vg call and get a vcf of non-reference paths that reads in my sample are likely to have originated from. I can extract the paths themselves - the list of nodes and directions (for example, >3719088>3719089>3719090>3719091>3719092>3719094>3719095>3719096>3719098>3719099>3719100). Can I include these paths as surjection targets during vg surject?

jeizenga commented 1 year ago

You are correct that those reads would be unmapped by vg surject. I think I'm not totally clear what behavior you'd want for this use case. The SAM/BAM format requires that the alignment be reported against a named sequence, but these short non-reference subpaths don't really have a name. If you wanted to introduce them as named sequences, one way to do so would be to manually append them as P lines on the GFA and then make a new XG index with vg convert.

JosephLalli commented 1 year ago

You've got it basically right, @jeizenga.

Context For a bit more context, I'd like to be able to use traditional, linear genome based tools with my pangenome mappings.

From what I can tell, surjection works great for this purpose. (Note: I can't find great documentation on the surjection algorithm, so I might have this totally wrong.) Surjection preserves indels that are less than one read length long, and preserves large deletions in your dataset. However, surjection cannot preserve insertions larger than one read length.

The solution I'm trying to develop is to call these insertions with vg call or Pangenie (probably vg call until there's a good CHM13v2 reference graph for Pangenie). I would then like to include these insertions in the reference graph as alternative contigs that reads can be surjected against. The end product would be a bam file containing the normal 1-22, X, Y and MT contigs, along with alignments to those contigs. The bam file would then go on to contain a few thousand insertion contigs of lengths > 150bp, along with reads that align to those insertions. These contigs would be named with the insertion's start and end coordinates.

If I understand everything correctly, that would allow for linear reference based tools to call variants and assess mapping quality in these insertions. It would also prevent 200X pileups of reads in reference regions that are duplicated in the sample.

In the end, I would like to create a tools that places these insertions back into the reference genome, creating a personalized diploid genome.

Specific question For the -p option in surject, what exactly is the unit I'm providing to vg surject? The name of a contig? A path? A walk? A thread? Is there a good definition of how vg uses these terms?

jeizenga commented 1 year ago

VG's graph data model allows any path to be included as an annotation to the graph. Most often, these paths are reference contigs/scaffolds. However, we do use the named path system for some other sequences as well (e.g. transcript paths through a spliced graph). In this data model, the path consists of a name and a sequence of nodes. The names are unique, and they are used as identifiers for the paths.

For your pipeline, you would first need to add these sequences as named paths (using any unique name). After that, you can use the -p or -F option to identify them as paths that you want to report alignments against.

The closest thing to a publication for the vg surject algorithm is probably chapter 4 of my dissertation.

JosephLalli commented 1 year ago

OK. Is there a good method of adding a path (as defined by a sequence of nodes) to a preexisting graph? Ideally one that doesn't require going through a gfa graph?

(I could always add the path manually to a gfa, but I've found that, in my hands, VG often crashes when working with gfa graphs.)

jeizenga commented 1 year ago

I don't think we've added that functionality anywhere, unfortunately. VG should at least play well with the GFAs that it produces, so you probably can't go too wrong with something like this:

vg convert -f graph.vg | cat - new_p_lines.txt | vg convert -g - > updated_graph.vg

JosephLalli commented 1 year ago

Working on this. One hiccup - vg call produces oriented lists of nodes in the following format:

>1>2<3>4

Whereas vg convert expects a path to contain a list of nodes in the following format:

1+,2+,3-,4+

My question: I think for first kind (I think a that's a walk?), the orientation is indicated prior to the node id. So, in the example, nodes 1, 2, and 4 are oriented in the forward direction, while 3 is oriented in the reverse direction. For the second kind (I think that's a path?), the orientation is indicated after the node id. I have tried to write it out so that, as above, nodes 1, 2, and 4 are oriented in the forward direction while node 3 is oriented in the reverse direction. Do I have the logic of these paths right? I can convert from one format to the other, but I want to double check that I'm correctly understanding the notation system.

glennhickey commented 1 year ago

The definitions of the two walk formats are here: https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md (P-lines vs W-lines). vg can (in general) handle both.

You can also add a path(s) directly to the graph from GAF (or GAM without -F) with vg augment -B -F. The GAF format will have paths that look a lot like vg call: https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf

Calling within large insertions is fairly high on our agenda for the next HPRC release. A lot of the components are in vg, but as you've seen, there's still some work to put it all together. The idea so far is to use something like rGFA to assign unique coordinates to all the off-reference sequence and use those for downstream analysis. So please feel free to keep in touch about this (and look for related updates in upcoming vg releases this year).

vgteam / vg

Best method of finding reads aligned to non-reference path #3816