Method for selecting variant truthset

Generating truth VCFs

There is three main ways we can do this.

Simulate variants on the reference assembly
Select a close-ish strain, align the reference assembly to that, and get variants from that alignment
Same as Option 2, but we apply the variants to our sample's assembly and call variants with respec to that mutated reference

Option 1 has the advantage of giving us absolute control over what variants to simulate, at what rate, what the truth variants are etc. The disadvantage is that this doesn't really simulate "real" mutational processes - e.g., genes mutating at different rates based on function and compared to intergenic regions. One tool, mutation-simulator does allow for more fine-grained simulations, but we would need to determine what the mutation rates are for each species etc.

Option 2 has the advantage of giving us more natural mutational processes. But had the downside of being a little more complicated to extract true variants, plus having to determine regions of the genome to ignore (due to unmapped regions or high variability etc.)

Option 3 is a kind of hybrid of Options 1 and 2. It has the advantage of removing ambiguity around what our truthset of variants is, while using variants that really do occur between two strains. However, the downside is you're removing some of the challenge caused by the distance between the two strains - it's still a bit artificial.

We all agree that Option 1 is a no go.

Selecting the VCF reference

The VCF reference (vcfref) would be a different strain from the same species. One way I have played with doing this selection is to download all complete genome assemblies from refseq

genome_updater.sh -d "refseq" -g "bacteria" -f "genomic.fna.gz" -o GTDB_species_Vibrio_parahaemolyticus -M "gtdb" -T "s__Vibrio parahaemolyticus" -m -a -t 8 -l "complete genome"

Create a Mash sketch of those genomes

mash sketch -p 8 -o vpara -s 10000 GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/*.gz

And compute the mash distance to the reference assembly, sorting by the distance column

$ mash dist -p 8 -s 10000 vpara.msh ../../data/references/ATCC_17802__202309.fa | sort -g -k3 | head
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_001558495.2_ASM155849v2_genomic.fna.gz       ../../data/references/ATCC_17802__202309.fa     2.38113e-06     0       9999/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_021730005.1_ASM2173000v1_genomic.fna.gz      ../../data/references/ATCC_17802__202309.fa     0.000388021     0       9839/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_021730065.1_ASM2173006v1_genomic.fna.gz      ../../data/references/ATCC_17802__202309.fa     0.000388021     0       9839/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_019321785.1_ASM1932178v1_genomic.fna.gz      ../../data/references/ATCC_17802__202309.fa     0.00483188      0       8240/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_013393885.1_ASM1339388v1_genomic.fna.gz      ../../data/references/ATCC_17802__202309.fa     0.0145334       0       5835/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_000430425.1_ASM43042v1_genomic.fna.gz        ../../data/references/ATCC_17802__202309.fa     0.0145695       0       5828/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_001244315.1_ASM124431v1_genomic.fna.gz       ../../data/references/ATCC_17802__202309.fa     0.0147041       0       5802/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_001887055.1_ASM188705v1_genomic.fna.gz       ../../data/references/ATCC_17802__202309.fa     0.0147041       0       5802/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_018135645.1_ASM1813564v1_genomic.fna.gz      ../../data/references/ATCC_17802__202309.fa     0.0147874       0       5786/10000
GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_009764075.1_ASM976407v1_genomic.fna.gz       ../../data/references/ATCC_17802__202309.fa     0.0148448       0       5775/10000

We then select a genome which has some distance from our original - preferably not the lowest distance. The mash distance approximates the average nucleotide identity (actually it is 1-ANI), so we could set an ANI we would like, such as 0.5% and then selecting a genome with a distance close to that.

Generating truth variants, targets, and mask

With a vcfref selected we now need to figure out what variants exist between the two genomes, what regions we want to target, and what regions we want to mask.

A very simple method here would be just use dnadiff between the two genomes and use the differences as the true variants. However, this will possibly miss variants, or even produce false positives. The way Martin approached this in varifier was to use dnadiff and minimap2 to produce two seperate sets of variants. He then makes probes out of these and maps them back to the reference, requiring perfect matching. If they don't perfectly match, he discards the variant.

This approach for truth variant generation is my preference, however, we need to take this a step or two further. Those false positive variants whose probes don't align should be output as a type of mask - i.e., we should not evaluate variants at these positions because they obviously have problems. We additionally need to identify whih=ch regions of the vcfref we allow variants in (targets). To elaborate, the vcfref is a different strain and so there will be parts of its genome which do not align with the reference assembly. We don't want to assess variants in these regions, just the regions that align. We can take this a step even further and remove regions from the targets that do not have depth 1 when aligning the vcfref and reference assembly as these are either repetitive regions, highly divered regions, or regions that exist in one genome and not the other.

A way of identifying these target regions would be to align the two genomes using asm5 (or similar) and piping this into samtools depth and keeping only positions with 1 in column three. Here is a way of counting the number of different depths in an alignment

$ minimap2 -x asm5 -ac --cs $ref ../../data/references/ATCC_17802__202309.fa | samtools sort -o vpara.bam
$ samtools depth -aa --reference $ref vpara.bam | cut -f3 | sort | uniq -c
 723066 0
4688603 1
  15337 2

then to extract these to a bcftools-compatible targets file

samtools depth -aa --reference $ref vpara.bam | awk -F'\t' '$3==1' | cut -f1,2 > vpara.targets

from the bcftools docs

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file can contain either positions (two-column format: CHROM, POS) or intervals (three-column format: CHROM, BEG, END), but not both. Positions are 1-based and inclusive. The columns of the tab-delimited BED file are also CHROM, POS and END (trailing columns are ignored), but coordinates are 0-based, half-open.

The above command will generate the tab-delimited (default) file used by bcftools. We probably should convert this to BED though for more versatility.

We can then subtract any masked regions from this file, or keep the two separate and use bcftools filter to keep targets and remove masked.

Reasons for going with Option 3

Here we document our reasoning for deciding to go with the hybrid method for truthset generation.

Initially, we had wanted to take a reference from another strain (VCFREF), align our sample's assembly (REF) to it and then take the set of variants between them as the truthset when calling variants with respect to VCFREF.

To do this, we align VCFREF and REF using both dnadiff and minimap2 with the intention of taking the variants in common between the two. We use syri to assess the alignments and pull out the variants. This has the added advantage of also identifyin structural variation between the two genomes, and facilitates visualisation of the alignment of the genomes.

Here is how the alignment and syri were run:

threads=2
vcfref=/data/scratch/projects/punim2009/NanoVarBench/tmp/GTDB_species_Vibrio_parahaemolyticus/2024-01-15_14-03-02/files/GCF_019321785.1_ASM1932178v1_genomic.fna.gz
ref=/data/scratch/projects/punim2009/data/references/ATCC_17802__202309.fa

minimap2 -t $threads --eqx -c --cs -x asm5 $vcfref $ref | sort -k6,6 -k8,8n > vpara.paf
fixchr -c vpara.paf -r $vcfref -q $ref -F P --contig_size 10000 --prefix vpara.

ref=$(realpath vpara.fixchr.qry.filtered.fa)
vcfref=$(realpath vpara.fixchr.ref.filtered.fa)

prefix='vpara.dnadiff'
delta="${prefix}.delta"
delta_filt="${prefix}.filtered.delta"
coords="${prefix}.filtered.coords"

nucmer --maxmatch -p $prefix $vcfref $ref
delta-filter -1 $delta > $delta_filt
show-coords -THrd $delta_filt > $coords

syri --nc $threads -c $coords -d $delta_filt -r $vcfref -q $ref --prefix "${prefix}."

prefix='vpara.mm2'
paf="${prefix}.paf"
minimap2 -t $threads --eqx -c --cs -x asm5 $vcfref $ref | sort -k6,6 -k8,8n > $paf
syri --nc $threads -c $paf -r $vcfref -q $ref --prefix "${prefix}." -F P

The first issue that arise is the disparity in the number of variants between the two alignment methods.

This is the syri summary of the dnadiff alignment

#Structural annotations
#Variation_type Count   Length_ref      Length_qry
Syntenic regions        21      4465551 4442403
Inversions      1       27540   27361
Translocations  49      626206  646012
Duplications (reference)        8       12408   -
Duplications (query)    9       -       1915
Not aligned (reference) 63      206715  -
Not aligned (query)     59      -       108500

#Sequence annotations
#Variation_type Count   Length_ref      Length_qry
SNPs    4447    4447    4447
Insertions      312     -       1095
Deletions       369     36850   -
Copygains       15      -       132923
Copylosses      18      138966  -
Highly diverged 52      130597  171447
Tandem repeats  3       980     919

And here is the minimap2 summary

#Structural annotations
#Variation_type Count   Length_ref      Length_qry
Syntenic regions        4       4543669 4496192
Inversions      1       27540   27361
Translocations  3       451172  477469
Duplications (reference)        0       0       -
Duplications (query)    1       -       325
Not aligned (reference) 3       237193  -
Not aligned (query)     5       -       149085

#Sequence annotations
#Variation_type Count   Length_ref      Length_qry
SNPs    413     413     413
Insertions      65      -       18115
Deletions       66      25812   -
Copygains       0       -       0
Copylosses      1       261     -
Highly diverged 107     1245193 1231784
Tandem repeats  0       0       0

Of particular concern is that dnadiff discovers and order of magnitude more SNPs abd indels than minimap2.

The other concern that comes about from this too is the differences between the two alignments from a structural perspective. e.g., dnadiff has 49 translocations compared to minimap2's 3. And these become apparent when we visualise the two alignments with plotsr

dnadiff alignment

plotsr --sr vpara.dnadiff.syri.out --genomes genomes.txt -o vpara.dnadiff.plotsr.png

vpara dnadiff plotsr

minimap2 alignment

plotsr --sr vpara.mm2.syri.out --genomes genomes.txt -o vpara.mm2.plotsr.png

vpara mm2 plotsr

Of particular concern from these plots is that the start and end of chromosome 1 on vcfref has noticeably different alignments from dnadiff and minimap2.

So, in the end, we elect to go with taking the union of the variants from dnadiff and minimap2 and applying them to REF to create MUTREF. We will then call variants with respect to MUTREF for the analysis.

I will update this issue with exactly how this process is done.

mbhall88 / NanoVarBench