vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Problems reproducing experiment 2(a) in the paper #1873

Open amit-jain-SBG opened 6 years ago

amit-jain-SBG commented 6 years ago

We received the simulated reads from the authors. However, it is not clear how to

a) extract Mapping Quality of the alignments,

b) derive the equivalent reference position of the reads that were simulated from graph nodes (needed for comparison to BWA etc)

adamnovak commented 6 years ago

Mapping quality can be extracted from a GAM by converting the GAM to JSON with vj view -aj whatever.gam; the .mapping_quality field in the resulting JSON gives the mapping quality of each read, and the absence of the field signifies mapping quality 0.

This is most useful when used with the jq tool. For example: to get a TSV of read names and mapping qualities, you can do:

vg view whatever.gam | jq -r '[.name, if .mapping_quality then .mapping_quality else 0 end] | @tsv' > whatever.tsv

For determining the effective linear position of simulated reads, you can first check if the reads have their .refpos field set in the JSON from vg view -aj; if you are looking at a simulated read truth set from the toil-vg scripts it should be set already.

If it's not set, you can annotate the reads with their positions on the paths in an XG index with:

vg annotate -a whatever.gam -x whatever.xg -p >annotated.gam

Then you can extract a TSV of read name and a pair of path name and offset columns for each path the read is on with:

vg view -aj annotated.gam | jq -r '[.name] + if .refpos != null then [.refpos[] | .name, if .offset != null then .offset else 0 end] else [] end | @tsv' > positions.tsv

You can also use the position-annotated GAM file as input to tools like vg gamcompare, which use the stored .refpos values to determine if reads have been correctly placed relative to a truth set.

If instead you want to do a BAM-to-BAM comparison against BWA, you can use vg surject to convert from GAM to BAM.

ekg commented 6 years ago

This thread moved here from email. I'll try to document what's been exchanged so far.

@adamnovak I think the problem here is @amit-jain-SBG is working with a simulated (not mapped) GAM, and so there is no mapping quality field. It shows the "true" alignment through the graph it was simulated from. To reproduce the results @amit-jain-SBG should be able to re-run the vg mapping using the same graph structure or set of variants that they're using as input to their method.

Here is the first part of the exchange:

I wanted to take the opportunity to document the generation of the simulated data. I'd like to make sure there is a script or at least wiki page describing it exactly. The simulation was unusual because we wanted it to be from the actual haplotypes of HG002, but needed also to have reference-relative positions. This was a natural application of the graph model, but until the last few iterations of the manuscript it wasn't completely stabilized and so it was never scripted up.

The plots in the paper were generated by this script: https://github.com/vgteam/vg/blob/master/scripts/map-sim

usage: map-sim [output-dir] [fasta-ref] [vg-ref] [vg-pan] [hap0-base] [hap1-base] [sim-base] [sim-ref] [threads] [sim-read-spec] [sim-seed] [bp-threshold] [vg-map-opts]

This script uses a lot of different inputs. The reason I'm not advocating you use it to reproduce the results is that I don't have a unified script to generate them:

[output-dir] : where to write the results and ROC plots
[fasta-ref] : reference FASTA file
[vg-ref] : linear graph in vg format, indexed for mapping
[vg-pan] : same for the pangenome we're testing
[hap0-base] : indexed variation graph of the first haploid genome of the sample subset from [sim-base]
[hap1-base] : same for the second haploid genome (subset from [sim-base])
[sim-base] : the graph from which these were derived (including the reference genome paths)
[sim-ref] : the linear subset of [sim-base] matching only the reference genome path
[threads] : how many CPUs to use during mapping with bwa and vg
[sim-read-spec] : the command line passed to `vg sim` for simulating the reads from [hap0-base] and [hap1-base]
[sim-seed] : the seed given to the simulator
[bp-threshold] : an alignment is counted as correct if it maps this far from the true simulated alignment
[vg-map-opts] : options given to vg during mapping, could be blank

As input I used the GiAB result HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz. I built [sim-base] from this and GRCh37, then used GBWT/xg index and vg paths to extract the haplotypes of HG002 into [hap0-base] and [hap1-base]. (The exact technique is quirky and also requires use of vg mod for graph subsetting.) I similarly extracted the reference genome paths from [sim-base] into the linear [sim-ref].

The subsetting (rather than rebuilding) ensured that the graph coordinates were the same in each case, ensuring that the vg annotate calls based on the simulated graph alignments (in GAM) were sensible. [hap0-base] and [hap1-base] are used to simulate the reads. But they must be annotated with the reference genome positions, for which we use [sim-base] and vg annotate -p. We also want to know which subset of the reads map exactly to the reference and which touch non-reference variation, and to do this we use [sim-ref] and vg annotate -n.

Beyond this I think the map-sim script should be enough to understand how the simulation worked. I would like to generate a full script to do this but it will take several days and it will not be possible for to focus on this for at least a few more weeks.

This is how the script was run for the results in the paper, varying ~/graphs/human/pan.AF0/wg to change the pangenome graph we used:

~/graphs/vg/scripts/map-sim HG002.10M.wg.9 \
    ~/graphs/hs37d5.fa \
    ~/graphs/human/ref/wg \
    ~/graphs/human/pan.AF0/wg \
    HG002.thread0 \
    HG002.thread1 \
    HG002 \
    HG002.ref \
    32 \
    "-n 2500000 -e 0.01 -i 0.002 -l 150 -p 500 -v 50" \
    2358792 \
    150 \
    " "
ekg commented 6 years ago

This responds to the particular issue raised in this thread:

Note that I didn't provide the ROC curves, but the truth set. This includes the reads (in fastq) and a table describing their mapping positions and the number of bases they have in the reference genome and in non-reference bases. You will need to build the reference graph from the same set of variants that you've chosen for your experiments and reproduce the alignments to build the correct ROC.

The sim.gam.truth.tsv header should be:

read.name, chr, pos, length.bp, unaligned.bp, known.nodes, known.bp, novel.nodes, novel.bp

"known" refers to things in the linear reference, "novel" refers to things not in the linear reference. The idea is that we need to be able to separate the reads which are exact matches to the reference from those which include non-reference variation. Recall the three panels from figure 2a in the paper. The left has all reads, the middle has only the ones exactly matching the reference, and the right has only those with some non-reference variation in them. These annotations were used to generate this subdivision.

You can reproduce the annotation pipeline that made sim.gam.truth.tsv on a small example using vg:

cd ~/vg/test # in the vg repo test directory
vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 32 -p >z.vg # construct a graph
vg mod -k z z.vg >z.ref.vg  # keep only the reference subset of the graph in another file
vg index -x z.xg z.vg # index the base graph
vg index -x z.ref.xg z.ref.vg # index the ref-only graph
vg sim -n 100 -l 150 -x z.xg -s 1337 -a | vg annotate -n -x z.ref.xg -a - | head # use "novelty annotation" to calculate the non-reference component of the simulated reads

This is then merged with the positional information from the simulation, see https://github.com/vgteam/vg/blob/master/scripts/map-sim#L53.

Let me know if this helps. I still haven't finished my thesis and I can't commit to scripting out the full pipeline (from 1000GP and phased GiAB calls to ROCs) yet, but I do think it's important, so that will be forthcoming. It's good to have this thread going to document this process.

amit-jain-SBG commented 6 years ago

Thanks @ekg , for the comprehensive explanation.

One question: is it correct to say that you simulate reads from the entire genome, but outside high confidence they should be all from linear reference (this is how I understand the statement "As input I used the GiAB result HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz.”)? I ask because I am wondering whether Is it fair to trust the reference haplotype outside high confidence region.

I'm sure we will have more question as we understand and replicate the process better :)

ekg commented 5 years ago

@jeizenga this is a test case that I've set up to help us validate the simulation pipeline.

HG002_1kgp_chr20_0_2M.tar.gz

@amit-jain-SBG we're working on a CWL pipeline to reproduce this from the source VCFs (for 1000G and HG002) and FASTA.

ekg commented 5 years ago

Here's (approximately) the script I used to generate the haplotype specific graphs for the simulation:

#!/bin/bash

for chrom in $(seq 1 22); do time vg construct -t 32 -R $chrom -C -r ~/graphs/hs37d5.fa -v HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz -m 32 -p -a | vg ids -s - >HG002.$chrom.vg; done
vg ids -j $(for chrom in $(seq 1 22); do echo -n HG002.$chrom.vg" "; done) && cat $(for chrom in $(seq 1 22); do echo -n HG002.$chrom.vg" "; done) >HG002.vg && time vg index -p -x HG002.xg -v HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz -a HG002.vg
vg index -p -x HG002.xg -v HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz -a HG002.vg
vg find -x HG002.xg $(for chrom in $(seq 1 22); do echo -n -q _thread_HG002_${chrom}_0" "; done) | pv >HG002.thread0.vg
vg find -x HG002.xg $(for chrom in $(seq 1 22); do echo -n -q _thread_HG002_${chrom}_1" "; done) | pv >HG002.thread1.vg
vg index -x HG002.thread0.xg HG002.thread0.vg 
vg index -x HG002.thread1.xg HG002.thread1.vg 
jeizenga commented 5 years ago

Working on it here: https://github.com/jeizenga/reproduce_vg_results

ekg commented 5 years ago

@jeizenga we'll need to do something like this to get the paths from the GBWT:

vg paths -x x.xg -g x.gbwt -V -q _thread_1_x_0 >thread_1_only.vg
pesho-ivanov commented 5 years ago

@ekg, do I understand correctly that the map-sim script for the ROC curves considers an alignment correct if it starts near the simulation truth (<150bp)? Have you checked the accuracy drop if you count only the exact-start matches, or even more precisely, the whole path equivalence?

ekg commented 5 years ago

Yes, that's correct. I haven't checked recently the difference between this and perfect match and whole path equivalence (which we can't easily establish for bwa mem). The idea here is that many variant calling approaches will collect read data "near" a locus and use that to call variants, where near is a small multiple of the read length. I would expect the numbers to be slightly worse in terms of FPR and TPR for all methods, because every soft clipped read would be considered a failure.

cademirch commented 5 years ago

@ekg Am I correct in understanding that the [vg-pan] and [vg-ref] are constructed separately from [sim-base], but using the same reference, thus they have the same reference coordinates? Could you also expand on the index construction and pruning of the [vg-pan]? Specifically, I am curious about the options creating the GBWT index.

I am trying to recreate this experiment, albeit with a different dataset and this information would be very helpful.

Thanks in advance.