vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Is there a way to figure out which read pairs are "properly paired" from JSON or GAF file? #3217

Closed ac2278 closed 3 years ago

ac2278 commented 3 years ago

A "properly paired" read pair is a pair that aligns with the expected relative mate orientation and with the expected range of distances between mates. Is there a way to figure out which read pairs are "properly paired" from a JSON file (first output) or a GAF file (second output)?

JSON output example: Screen Shot 2021-02-17 at 3 57 04 PM

GAF output example for one pair of reads: Col Type Description
1 string Query sequence name
2 int Query sequence length
3 int Query start (0-based; closed)
4 int Query end (0-based; open)
5 char Strand relative to the path: "+" or "-"
6 string Path matching /([><][^\s><]+(:\d+-\d+)?)+|([^\s><]+)/
7 int Path length
8 int Start position on the path (0-based)
9 int End position on the path (0-based)
10 int Number of residue matches
11 int Alignment block length
12 int Mapping quality (0-255; 255 for missing)

V1 "A00649:64:HLYN7DRXX:2:2142:7229:19774/1"
V2 "151"
V3 "0"
V4 "151"
V5 "+"
V6 "<43624457<43624456<43624455<43624454<43624453"
V7 "160"
V8 "6"
V9 "20"
V10 "130"
V11 "151"
V12 "60"
V13 "AS:i:87"
V14 "bq:Z:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFF" V15 "cs:Z::4AG:17TC:13CT:12CG:11TCAG:12GA:28AG:5CT:13AG:1TC:8TC:6+AAGGAGCCG"
V16 "dv:f:0.1391"
V17 "fn:Z:A00649:64:HLYN7DRXX:2:2142:7229:19774/2"

V1 "A00649:64:HLYN7DRXX:2:2142:7229:19774/2"
V2 "150"
V3 "0"
V4 "150"
V5 "+"
V6 ">43624446>43624447>43624448>43624449>43624450>43624451"
V7 "192"
V8 "28"
V9 "16"
V10 "139"
V11 "150"
V12 "60"
V13 "AS:i:106"
V14 "bq:Z:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF" V15 "cs:Z::26AG:54+GC:3CT:2TC:2CG:25CT:2CT:3AG:2CT:19*TC:1"
V16 "dv:f:0.0733"
V17 "fp:Z:A00649:64:HLYN7DRXX:2:2142:7229:19774/1"

jeizenga commented 3 years ago

If you're aligning with either vg map or vg mpmap, we added an annotation field to the output that will indicate if the mappings are properly paired. You can also count these efficiently from a GAM file using vg stats -a. If you're aligning with vg giraffe, the output doesn't annotate properly pairing yet, but one of our devs is going to add it soon. Let me know if you have any further questions!

ac2278 commented 3 years ago

Wow, thanks a bunch, @jeizenga! Are these changes in the latest release (vg 1.31.0 - Caffaraccia) or will I need to clone the vg repository then build from source?

jeizenga commented 3 years ago

Unfortunately, it didn't quite make it into the last release, so you'll need to build from source.

ac2278 commented 3 years ago

Thanks, @jeizenga. I have one question regarding the new "properly paired" annotation flag:

After I build from source, if I "vg map" reads against my graph to get a GAM file then 'vg surject' the alignments back into the reference space to get a BAM file, I can run "samtools flagstat" on the BAM files to see how many read pairs were properly paired (i.e., the properly paired entry will no longer be NA; see below)?

Screen Shot 2021-04-06 at 3 14 31 PM

jeizenga commented 3 years ago

As it stands, I don't think that would work. The GAM will have the pairing annotated, but vg surject doesn't preserve the proper pairing status.

It seems to me that there's a semantic ambiguity here, because the read may be properly paired with respect to the graph but not properly paired with respect to the surjection path. For instance, the graph might contain an inversion, and two ends of the same fragment might align to opposite strands of the path sequence. What behavior would you be looking for in this case?

ac2278 commented 3 years ago

@jeizenga, I'm not looking for anything specific. I constructed a graph using 1 linear reference genome and a multi-sample VCF file. I then aligned reads to produce a GAM. I want to convert the GAM file into a BAM file because I'd like to use samtools to remove potential PCR duplicates, and the only way to do this (from what I can tell) is to use the surject command. Is this the wrong way to go about this? Do I lose anything by projecting graph alignments onto the linear reference?

jeizenga commented 3 years ago

Projecting down to a linear reference is a lossy conversion, and there's not really any way around that fact. If you convert from a more expressive formalism to a less expressive formalism, you risk losing information. However, the vg surject algorithm is designed to preserve as much of the graph alignment as possible, even if that means making alignments that seem suboptimal when looking only at the linear reference. For instance, if the graph alignment spans a large deletion, the alignment will preserve that deletion even though it gives the alignment a very low score.

ac2278 commented 3 years ago

Thanks, @jeizenga. This is where I get confused. If my ultimate goal is to to get a matrix of called genotypes that I can use as input for a principal component analysis, GWAS, etc., doesn't that mean that, at some point in the process (the steps between aligning reads to a genome graph and calling genotypes and genotype imputation), I need to project down to a linear reference? If I call variants and generate a VCF file, is vg not using the linear reference as the coordinate system (to define the physical position of variants)? Maybe I'm not understanding what "projecting" means in this context.

jeizenga commented 3 years ago

For some of these analyses, it might not actually be important to have a linear coordinate system. In particular, PCA is invariant to permuting the rows/columns of the matrix, so any ordering will work fine. For GWAS, some idea of ordering is sometimes important if you want to account for linkage.

There aren't really any widely agreed-upon ordering formalisms or file formats for describing variants in the graph space. In contrast, the genotype matrix formalism is pretty standardized, and the VCF format is well supported by lots of tools. I expect there will be a lot of research on graph-based formats in the next few years, but--unless you want to be on the bleeding edge of pangenomics methods research--for now you need to be working in VCF.

So, to answer your question: you are right that it is necessary to project down to a linear reference. The real question is "when" not "if". Specifically, do you want to project before or after calling variants. vg surject projects each read alignment down to the linear reference so that it can be expressed as SAM/BAM. At that point, you could use linear reference-based variant calling tools (DeepVariant, GATK, etc.). On the other hand, vg call requires graph alignments that can be expressed in GAF or GAM format, which can be produced by the vg mapping tools (map, mpmap, and giraffe). However, output of vg call is VCF, which is a linear reference-based format.